0

I have a pice of HTML:

<div class="content" itemprop="softwareVersion"> 2.3  </div> 

(This is the version of my app in the play store) What i am trying to do, is get the latest version using Pattern matching.

what i have thus far for matching the pattern is:

String htmlString = "Some very long webpage string that includes the above tag"
Pattern pattern = Pattern.compile("softwareVersion\"> [^ <]*</dd");
Matcher matcher = pattern.matcher(Html);
matcher.find();

How do i now go about extractin 2.3 from the htmlString?

stoic
  • 4,700
  • 13
  • 58
  • 88
  • 1
    This is not a useful answer, but just as a warning: http://stackoverflow.com/a/1732454/674108 – Jeff Burka Oct 01 '15 at 19:17
  • Yeah, you should use a proper parser – Mulan Oct 01 '15 at 19:18
  • In general, parse HTML with an HTML parser. Regardless of that, that expression won't match because you have spaces before the `<`. And you should familiarize yourself with [capturing groups](http://stackoverflow.com/q/16038206/4125191). – RealSkeptic Oct 01 '15 at 19:23

5 Answers5

1

Using JSoup xhtml parser

It's well known that you should not parse xhtml with regex unless you know the html character set you are going to parse. You should use a xhtml parser instead like JSoup. So, you could use something like this:

 String htmlString = "YOUR HTML HERE";
 Document document=Jsoup.parse(htmlString);
 Element element=document.select("div[itemprop=softwareVersion]").first();
 System.out.println(element.text());

Regex approach

However, if you want to use regex, then you have to use capturing groups and then grab its content.

String htmlString = "Some very long webpage string that includes the above tag"
Pattern pattern = Pattern.compile("softwareVersion\"> ([^ <]*)</dd");
                                               //     ^------^ Here
Matcher matcher = pattern.matcher(htmlString);
while (matcher.find()) {
    System.out.println(matcher.group(1));
}
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
0

Try to capture it in a capture group?

("softwareVersion\"> ([^ <]*)< /dd");

Then accessing the value with matcher.group(1)

FormerNcp
  • 116
  • 1
  • 6
0

I had to tweak a few things to make this work:

String htmlString = "String that includes <div class=\"content\" itemprop=\"softwareVersion\"> 2.3  </div>";
Pattern pattern = Pattern.compile("softwareVersion\"> ([^ <]*) +</div");
Matcher matcher = pattern.matcher(htmlString);
if (matcher.find())
{
    System.out.println(matcher.group(1));
}
//else??

The () in the RE make it possible to use matcher,group(1)

Jonathan
  • 349
  • 1
  • 9
0

First, as comments point out, you can't parse HTML with a regex (thanks to Jeff Burka for linking to the canonical answer).

Second, since you are looking at a very limited and particular situation you can match using a capturing group to get the version.

Assuming that the div in question is not broken across lines, my strategy would be much like your posted attempt; look for the string softwareVersion and the tag close > character, optional whitespace, the version string, optional whitespace, and the closing tag.

That gives a regex like softwareVersion[^>]*>\s*([0-9.]+)\s*</

From debuggex (which needs the .* to match the leading part):

.*softwareVersion[^>]*>\s*([0-9.]+)\s*</

Regular expression visualization

Debuggex Demo

This will give you the version in a capturing group, which will be matcher.group(1)

As a Java string, that's softwareVersion[^>]*>\\s*([0-9.]+)\\s*</


I omitted the div after </ because, while it's in a div now, maybe it'll be a span or something else in the future.
I went simple with [0-9.] so it can match 2.3 but also 3.0.1, however it would also match ..382.1...33 — you could make one that matches a limited or arbitrary set of n(.n)* dotted numbers if it was important.


softwareVersion[^>]*>\\s*([1-9][0-9]*(\\.[0-9]+){0,3})\\s*</ matches a version number n with zero to three .n point releases, so 3.0.2.1 but not 1.2.3.4.5

Stephen P
  • 14,422
  • 2
  • 43
  • 67
0

Try this Regex \"softwareVersion\">\s([0-9].?[0-9]?+)\s\s<\/div>:

\" matches the character " literally
softwareVersion matches the characters softwareVersion literally (case sensitive)
\" matches the character " literally
> matches the characters > literally
\s match any white space character [\r\n\t\f ]
1st Capturing group ([0-9].?[0-9]?+)
[0-9] match a single character present in the list below
0-9 a single character in the range between 0 and 9
.? matches any character (except newline)
Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
[0-9]?+ match a single character present in the list below
Quantifier: ?+ Between zero and one time, as many times as possible, without giving back [possessive]
0-9 a single character in the range between 0 and 9
\s match any white space character [\r\n\t\f ]
\s match any white space character [\r\n\t\f ]
< matches the characters < literally
\/ matches the character / literally
div> matches the characters div> literally (case sensitive)

https://regex101.com/r/kR7lC2/1

Rx Seven
  • 529
  • 3
  • 11