Using patern matcher to extract html

Question

I have a pice of HTML:

<div class="content" itemprop="softwareVersion"> 2.3  </div>

(This is the version of my app in the play store) What i am trying to do, is get the latest version using Pattern matching.

what i have thus far for matching the pattern is:

String htmlString = "Some very long webpage string that includes the above tag"
Pattern pattern = Pattern.compile("softwareVersion\"> [^ <]*</dd");
Matcher matcher = pattern.matcher(Html);
matcher.find();

How do i now go about extractin 2.3 from the htmlString?

This is not a useful answer, but just as a warning: http://stackoverflow.com/a/1732454/674108 — Jeff Burka, Oct 01 '15 at 19:17
In general, parse HTML with an HTML parser. Regardless of that, that expression won't match because you have spaces before the `<`. And you should familiarize yourself with [capturing groups](http://stackoverflow.com/q/16038206/4125191). — RealSkeptic, Oct 01 '15 at 19:23

Federico Piazza · Accepted Answer · 2015-10-01T20:11:14.337

Using JSoup xhtml parser

It's well known that you should not parse xhtml with regex unless you know the html character set you are going to parse. You should use a xhtml parser instead like JSoup. So, you could use something like this:

 String htmlString = "YOUR HTML HERE";
 Document document=Jsoup.parse(htmlString);
 Element element=document.select("div[itemprop=softwareVersion]").first();
 System.out.println(element.text());

Regex approach

However, if you want to use regex, then you have to use capturing groups and then grab its content.

String htmlString = "Some very long webpage string that includes the above tag"
Pattern pattern = Pattern.compile("softwareVersion\"> ([^ <]*)</dd");
                                               //     ^------^ Here
Matcher matcher = pattern.matcher(htmlString);
while (matcher.find()) {
    System.out.println(matcher.group(1));
}

score 0 · Answer 2 · answered Oct 01 '15 at 19:21

0

Try to capture it in a capture group?

("softwareVersion\"> ([^ <]*)< /dd");

Then accessing the value with matcher.group(1)

answered Oct 01 '15 at 19:21

FormerNcp

116
1
6

Jonathan · Answer 3 · 2015-10-01T19:36:36.037

I had to tweak a few things to make this work:

String htmlString = "String that includes <div class=\"content\" itemprop=\"softwareVersion\"> 2.3  </div>";
Pattern pattern = Pattern.compile("softwareVersion\"> ([^ <]*) +</div");
Matcher matcher = pattern.matcher(htmlString);
if (matcher.find())
{
    System.out.println(matcher.group(1));
}
//else??

The () in the RE make it possible to use matcher,group(1)

Stephen P · Answer 4 · 2015-10-01T19:54:33.860

First, as comments point out, you can't parse HTML with a regex (thanks to Jeff Burka for linking to the canonical answer).

Second, since you are looking at a very limited and particular situation you can match using a capturing group to get the version.

Assuming that the div in question is not broken across lines, my strategy would be much like your posted attempt; look for the string softwareVersion and the tag close > character, optional whitespace, the version string, optional whitespace, and the closing tag.

That gives a regex like softwareVersion[^>]*>\s*([0-9.]+)\s*</

From debuggex (which needs the .* to match the leading part):

.*softwareVersion[^>]*>\s*([0-9.]+)\s*</

Regular expression visualization

Debuggex Demo

This will give you the version in a capturing group, which will be matcher.group(1)

As a Java string, that's softwareVersion[^>]*>\\s*([0-9.]+)\\s*</

I omitted the div after </ because, while it's in a div now, maybe it'll be a span or something else in the future.
I went simple with [0-9.] so it can match 2.3 but also 3.0.1, however it would also match ..382.1...33 — you could make one that matches a limited or arbitrary set of n(.n)* dotted numbers if it was important.

softwareVersion[^>]*>\\s*([1-9][0-9]*(\\.[0-9]+){0,3})\\s*</ matches a version number n with zero to three .n point releases, so 3.0.2.1 but not 1.2.3.4.5

score 0 · Answer 5 · answered Oct 01 '15 at 19:38

Try this Regex \"softwareVersion\">\s([0-9].?[0-9]?+)\s\s<\/div>:

\" matches the character " literally
softwareVersion matches the characters softwareVersion literally (case sensitive)
\" matches the character " literally
> matches the characters > literally
\s match any white space character [\r\n\t\f ]
1st Capturing group ([0-9].?[0-9]?+)
[0-9] match a single character present in the list below
0-9 a single character in the range between 0 and 9
.? matches any character (except newline)
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
[0-9]?+ match a single character present in the list below
Quantifier: ?+ Between zero and one time, as many times as possible, without giving back [possessive]
0-9 a single character in the range between 0 and 9
\s match any white space character [\r\n\t\f ]
\s match any white space character [\r\n\t\f ]
< matches the characters < literally
\/ matches the character / literally
div> matches the characters div> literally (case sensitive)

https://regex101.com/r/kR7lC2/1

Using patern matcher to extract html

5 Answers5

Using JSoup xhtml parser

Regex approach