Cookbook: Matching and Extracting Strings with Regular Expression

October 5, 2013

Regular expressions are very handy! They are most of time the best way to extract or validate a String. Coming up with the correct expression, though, is usually a bit tricky, but there are thousands of references out there on how to refine them for your criteria.

Personally, I recommend regexpal to test your regular expression before going into code. Easy to use, clean and intuitive.

After the expression is resolved, these are the basic steps to match a String in Java and extract a specific part of it:

// create a pattern
Pattern pattProduct = Pattern.compile(regex, Pattern.DOTALL);

regex is the regular expression gives as String.

Pattern.DOTALL means that the dot . in the regular expression matches any character, including line terminator. More info on Javadoc.

// create a matcher from the given pattern for the String content
Matcher matcher = pattProduct.matcher(content);

Basically, the Matcher is trying to match the pattern above with a given String content.

Then, one way to know if there is a match is by using the boolean find() method, which will look for subsequent matches.

// find the first pattern match
if (!matcher.find()){
	System.out.println("The content does not match the given pattern.");
}

Then, an advanced usage would be to extract a part comprehended by that pattern. For instance, suppose we want to extract the value represented by #VALUE# in the following URL tag:

<tag style="text-align:left;">#VALUE#</tag>

The respective regular expression should be:

String regex = "<tag style=\"text-align:left;\">(.*?)</tag>"

So, we are interested on everything between <tag style="text-align:left;"> and </tag> tags, which is represented by the regular expression (.*?).

To extract this value, the Matcher provides the function Matcher.group(). So, given a content containing the String <tag style="text-align:left;">123456789</tag>:

System.out.println(matcher.groupCount());
System.out.println(matcher.group(1));

It will print 1 and 123456789 respectively.

As a conclusion, we could see how a regular expression allied to the Java Pattern and Matcher classes could make life much easier.