1

I have a webpage converted to a string and I'm trying to extract three numbers from it from this line.

<td class="col_stat">1</td><td class="col_stat">0</td><td class="col_stat">1</td>

From the line above I already have it extracting the first '1' using this

String filePattern = "<td class=\"col_stat\">(.+)</td>";
    pattern = Pattern.compile(filePattern);
    matcher = pattern.matcher(text);
    if(matcher.find()){
        String number = matcher.group(1);
        System.out.println(number);
    }       

Now what I want to do is extract the 0 and the last 1 but anytime I try edit the regular expression above it just outputs the complete webpage on the console. Anyone have any suggestions?? Thanks

user602415
  • 13
  • 2

5 Answers5

2

Given that using regexps on HTML/XML is a notorious gotcha (see here for the definitive answer), I'd recommend doing this reliably using an HTML parser (e.g. JTidy - although it's a HTML pretty-printer, it also provides a DOM interface to the document)

Community
  • 1
  • 1
Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
2

Regex matching is greedy, try this instead (looking only for (\d+) instead of (.+) (which matches everything until the last </td>):

String text = 
    "<td class=\"col_stat\">1</td>" + 
    "<td class=\"col_stat\">0</td>" + 
    "<td class=\"col_stat\">1</td>";
String filePattern = "<td class=\"col_stat\">(\\d+)</td>";
Pattern pattern = Pattern.compile(filePattern);
Matcher matcher = pattern.matcher(text);
while (matcher.find())
{
    String number = matcher.group(1);
    System.out.println(number);
}

On a related note, I completely agree with other's suggestions to use a more structured approach to interpreting HTML.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Vikdor
  • 23,934
  • 10
  • 61
  • 84
1
<td class=\"col_stat\">(.+)</td>

this regex is greedy. If you wish to make it work with numbers - change it as:

<td class=\"col_stat\">(\\d+?)</td>

and I'd rather suggest to use XPath for such kind of matching, see Saxon and TagSoup

jdevelop
  • 12,176
  • 10
  • 56
  • 112
0

This is because your matcher is greedy. You need a non-greedy matcher to fix this.

String text = "<td class=\"col_stat\">1</td><td class=\"col_stat\">0</td><td class=\"col_stat\">1</td>";

    String filePattern = "<td class=\"col_stat\">(.+?)</td>";
    Pattern pattern = Pattern.compile(filePattern);
    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
        String number = matcher.group(1);
        System.out.println(number);
    }
Marek Dec
  • 954
  • 6
  • 8
0

Try this regular expression:

<td class="col_stat">(\d+)[^\d]+(\d+)[^\d]+(\d+)

This does the following:

  1. search for your start string
  2. select a chain of decimals
  3. skip any NON-decimals
  4. select a chain of decimals
  5. skip any NON-decimals
  6. select a chain of decimals
Philipp
  • 67,764
  • 9
  • 118
  • 153