Java RegExp - Extracting only numbers from a webpage

Question

I have a webpage converted to a string and I'm trying to extract three numbers from it from this line.

<td class="col_stat">1</td><td class="col_stat">0</td><td class="col_stat">1</td>

From the line above I already have it extracting the first '1' using this

String filePattern = "<td class=\"col_stat\">(.+)</td>";
    pattern = Pattern.compile(filePattern);
    matcher = pattern.matcher(text);
    if(matcher.find()){
        String number = matcher.group(1);
        System.out.println(number);
    }

Now what I want to do is extract the 0 and the last 1 but anytime I try edit the regular expression above it just outputs the complete webpage on the console. Anyone have any suggestions?? Thanks

score 2 · Answer 1 · edited May 23 '17 at 12:27

2

Given that using regexps on HTML/XML is a notorious gotcha (see here for the definitive answer), I'd recommend doing this reliably using an HTML parser (e.g. JTidy - although it's a HTML pretty-printer, it also provides a DOM interface to the document)

edited May 23 '17 at 12:27

Community

1
1

answered Sep 04 '12 at 11:41

Brian Agnew

268,207
37
334
440

score 2 · Accepted Answer · edited Sep 04 '12 at 12:37

Regex matching is greedy, try this instead (looking only for (\d+) instead of (.+) (which matches everything until the last </td>):

String text = 
    "<td class=\"col_stat\">1</td>" + 
    "<td class=\"col_stat\">0</td>" + 
    "<td class=\"col_stat\">1</td>";
String filePattern = "<td class=\"col_stat\">(\\d+)</td>";
Pattern pattern = Pattern.compile(filePattern);
Matcher matcher = pattern.matcher(text);
while (matcher.find())
{
    String number = matcher.group(1);
    System.out.println(number);
}

On a related note, I completely agree with other's suggestions to use a more structured approach to interpreting HTML.

score 1 · Answer 3 · answered Sep 04 '12 at 11:45

1

<td class=\"col_stat\">(.+)</td>

this regex is greedy. If you wish to make it work with numbers - change it as:

<td class=\"col_stat\">(\\d+?)</td>

and I'd rather suggest to use XPath for such kind of matching, see Saxon and TagSoup

answered Sep 04 '12 at 11:45

jdevelop

12,176
10
56
112

score 0 · Answer 4 · answered Sep 04 '12 at 11:49

This is because your matcher is greedy. You need a non-greedy matcher to fix this.

String text = "<td class=\"col_stat\">1</td><td class=\"col_stat\">0</td><td class=\"col_stat\">1</td>";

    String filePattern = "<td class=\"col_stat\">(.+?)</td>";
    Pattern pattern = Pattern.compile(filePattern);
    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
        String number = matcher.group(1);
        System.out.println(number);
    }

score 0 · Answer 5 · answered Sep 04 '12 at 11:50

Try this regular expression:

<td class="col_stat">(\d+)[^\d]+(\d+)[^\d]+(\d+)

This does the following:

search for your start string
select a chain of decimals
skip any NON-decimals
select a chain of decimals
skip any NON-decimals
select a chain of decimals

Java RegExp - Extracting only numbers from a webpage

5 Answers5