-1

I am trying to parse out data from a HTML page using a Java RegEx but have not had much luck. The data is dynamic and often includes zero to many instances of spaces, tabs, new lines. Also, depending on the number of hits the structure of the string I'm parsing may change. Here is a sample in the cleanest format:

<div class="center">Showing 25 of 2,343,098 (search took 1.245 seconds)</div>

However it can also look like this:

<div class="center">Showing 2343098 (search took 1.245 seconds)</div>

or

<div class="center">

  Showing            125 

 of 2,343,098 




(search took 1.245 seconds)</div>

What I'm trying to parse is the 2,343,098 but since the pages is HTML I have to use either "Showing" or "(search took" to search between. The spaces, tabs and new lines are tripping me up and I've been trying to use lookahead & lookbehind but so far no luck. Here are a few patterns I've tried

String pattern1 = "Showing [0-9]*\\S"; // not useful
String pattern2 = "[[\\d,+\\.?\\d+]*[\\s*\\n]\\(search took"; //fails
String pattern3 = "(/i)(Showing)(.+?)(\\(search took)"; //fails
String pattern4 = "([\\s\\S]*)\\(search took"; //fails
String pattern5 = "(?s)[\\d].*?(?=\\(search took)"; //close...but fails

Pattern pattern = Pattern.compile(pattern5);
Matcher matcher = pattern.matcher(text); // text = the string I'm parsing
while(matcher.find()) {
    System.out.println(matcher.group(0));
}
Firo
  • 15,448
  • 3
  • 54
  • 74
Pigasus
  • 130
  • 1
  • 2
  • 10
  • 2
    "I am trying to parse out data from a HTML page using a Java RegEx" not again http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. Also do you know the difference between `[..]` and `(..)`? – Pshemo Jul 18 '14 at 15:02
  • *have not had much luck* that is why you should use a html parser – A4L Jul 18 '14 at 15:03
  • Wow, so honestly tell us how you feel about parsing html using regex? Let me re-phrase. "I HAVE A BIG LONG STRING THAT I NEED TO PARSE" and it contains a bunch of open and close carots! – Pigasus Jul 18 '14 at 15:06

2 Answers2

1

HTML is not a regular language, and cannot be accurately parsed using regular expressions. Regex-based solutions are likely to break when the format of the markup changes in future, but a parser-based solution will be more accurate.

However, if this is a one-off job, you can get away with the following regex:

Showing\s+(?:\d+\s+of\s+)?([\d,.]+)\s+\(search

Demo

Amal Murali
  • 75,622
  • 18
  • 128
  • 150
0

The examples suggest

"Showing\\s+\\d+\\s+(of\\s+[\\d,.]+\\s+)?\\(search"
laune
  • 31,114
  • 3
  • 29
  • 42
  • It's not cler to me whether you want to detect "Showing ... (search..." even when the "of N" is absent. If not, simply remove the '?'. – laune Jul 18 '14 at 15:13