3

In Java:

Time ago I wrote a code that downloads a webpage and then parses it to find a specific value. I used a regexp like this and everything went fine.

Pattern p = Pattern.compile("<tr.*?>.*?<td.*?>FOO:</td>.*?<td.*?>(.*?)</td>.*?</tr>", Pattern.DOTALL);
Matcher m = p.matcher(page);
m.find();

Today the webpage changed, the string FOO is no longer present, and suddenly m.find() does not return anymore, blocking my whole application.

I then started to investigate and doing some debug, and found that with a normal html page (200kb, 3000 lines), if FOO is present the above regexp works fast, otherwise it takes hours.

Then I said.. Ok, probably the complexity of this expression justifies the long time waiting. But I wanted to verify my assumption, so I prepared some test in other languages and slightly modified the above pattern.

After I saved the webpage in a file and I modified it and inserted FOO where it was supposed to be. Then I wrote 4 tests:

  1. to match FOO with DOT_ALL
  2. to unmatch BAR with DOT_ALL
  3. to unmatch FOO without DOT_ALL
  4. to unmatch BAR without DOT_ALL

You can reach the test page here: http://pastebin.com/2S9fEpxD

In perl:

cat page.html | perl -e '$str = do { local $/; <> }; $str =~ /<tr.*?>.*?<td.*?>FOO:<\/td>.*?<td.*?>(.*?)<\/td>.*?<\/tr>/s; print "$1\n";'

cat page.html | perl -e '$str = do { local $/; <> }; $str =~ /<tr.*?>.*?<td.*?>BAR:<\/td>.*?<td.*?>(.*?)<\/td>.*?<\/tr>/s; print "$1\n";'

cat page.html | perl -e '$str = do { local $/; <> }; $str =~ /<tr.*?>.*?<td.*?>FOO:<\/td>.*?<td.*?>(.*?)<\/td>.*?<\/tr>/; print "$1\n";'

cat page.html | perl -e '$str = do { local $/; <> }; $str =~ /<tr.*?>.*?<td.*?>BAR:<\/td>.*?<td.*?>(.*?)<\/td>.*?<\/tr>/; print "$1\n";'

Test 1,2 and 4 instantly return. Test 3 takes 19 seconds to finish.

In PhP:

preg_match( '#<tr.*?>.*?<td.*?>FOO:</td>.*?<td.*?>(.*?)</td>.*?</tr>#s',file_get_contents('page.html'), $vals);

preg_match( '#<tr.*?>.*?<td.*?>BAR:</td>.*?<td.*?>(.*?)</td>.*?</tr>#s',file_get_contents('page.html'), $vals);

preg_match( '#<tr.*?>.*?<td.*?>FOO:</td>.*?<td.*?>(.*?)</td>.*?</tr>#',file_get_contents('page.html'), $vals);

preg_match( '#<tr.*?>.*?<td.*?>BAR:</td>.*?<td.*?>(.*?)</td>.*?</tr>#',file_get_contents('page.html'), $vals);

All 4 tests return instantly.

In Java, again:

Just to complete my test, I also performed test 3 and 4 in Java, and it took hours, just like test 2 (but not 1, that matches, and does it quickly)

This is the code I used (test 3 in this case):

FileReader fr = new FileReader("page.html");
char[] buff = new char[(int)new File("page.html").length()];
fr.read(buff);
fr.close();
String page = new String(buff);

Pattern p = Pattern.compile("<tr.*?>.*?<td.*?>FOO:</td>.*?<td.*?>(.*?)</td>.*?</tr>" /*, Pattern.DOTALL*/);
Matcher m = p.matcher(page);
System.out.println(m.find());

Conclusion

PhP behaves better than Perl, and extremely better than Java. Why? If php is able to tell quickly when this regexp match or not, why shouldn't the same technology be ported in Java? I always though that regular expression world was fully understood by human kind and that there were no other discoveries left to do.

Jack
  • 1,488
  • 11
  • 21
  • 2
    "I wrote a code that downloads a webpage and then parses it to find a specific value". Sorry, I stopped reading there. You should create a reproducible example demonstrating your result. – Tunaki Nov 21 '15 at 22:43
  • 7
    No wonder everyone posts a [link to this SO post](http://stackoverflow.com/a/1732454/3832970) when someone tries to parse HTML with regex. If you chose an HTML parser from the start, you'd have no problem at all. Any regex with `.*?` is doomed when large input is processed. – Wiktor Stribiżew Nov 21 '15 at 22:44
  • @Jack: Ok, and for your tests what is FOO and what is BAR? (where is located FOO more or less in your file (at the beginning, the middle, the end)?) – Casimir et Hippolyte Nov 21 '15 at 23:37
  • 1
    I tried your regex with one `".*?"` and it was quite fast, and then with two and it was slow, even for a found string (22 seconds). My guess is backtracking. Java isn't optimizing backtracking well here, so if there's something you can do to eliminate it it'll be faster. – markspace Nov 22 '15 at 00:01
  • @markspace: non-greedy quantifiers don't cause backtracking, only greedy quantifiers do. A non-greedy quantifier takes the characters one by one and must test if all the subpattern after fails to take a character (until the subpattern succeeds. The greedy quantifier takes all possible characters and after gives characters back until the subpattern succeeds (backtracking). – Casimir et Hippolyte Nov 22 '15 at 00:16
  • 3
    At the time I write this, three reviewers have voted to close this as too broad. I agree it is broad. However, I think the subject is worthwhile enough, and the post well-written enough, to justify leaving it open. – Tom Zych Nov 22 '15 at 00:58
  • @CasimiretHippolyte And are you sure that Java is implemented that way? Can you show me the code? Yes, I think if this is backtracking it's a bug, but it could still be a bug. – markspace Nov 22 '15 at 02:04
  • 1
    And I have to take that back slightly. `` does to backtrack because `.*?` could eat the starting literal of that pattern. Given the number of those patterns the OP employs, I think there could be something akin to catastrophic backtracking, even if there's no bugs in the Java implementation. – markspace Nov 22 '15 at 02:14
  • 2
    @TomZych, in a way I agree with you, but based on the presented details, it is not enough to get a good answer. There are missing details and many areas to address. It is interesting and worthwhile, but too broad. – Sully Nov 22 '15 at 06:39
  • 8
    1) The fact that you wrote an awful program is not Perl fault (You have the equivalent of 8 nested loops), 2) Perl's engine is better optimised for failed matches over successful matches, 3) Perl's engine has features the others don't (costing some speed), and 4) Perl's engine has far fewer bugs than the others (costing some speed). – ikegami Nov 22 '15 at 06:44
  • @HithamS.AlQadheeb what details are missing? – Jack Nov 22 '15 at 08:30
  • @ikegami sorry i'm not a perler, just copy the regexp match somewhere else, how could you rewrite test 3 and make it go faster? – Jack Nov 22 '15 at 08:30
  • @CasimiretHippolyte FOO is a string present in the page (line 2426), BAR is string not present – Jack Nov 22 '15 at 08:30
  • 1
    @jm666: I know what parsing is: [*to analyze (a string of characters) in order to associate groups of characters with the syntactic units of the underlying grammar.*](http://dictionary.reference.com/browse/parse) Any regex engine is *parsing* text using a given regex. I read that post, and I agree there can be valid ways of getting substrings from any text, even HTML code. However, the task of getting some node outer HTML can be safely solved with a proper HTML parser. You can use tongs to hammer a nail into a wall, but a hammer is more appropriate, isn't it? – Wiktor Stribiżew Nov 22 '15 at 09:02
  • @Jack, `.*?` can match far too much. It should be replaced with what you actually want to match to eliminate backtracking. – ikegami Nov 22 '15 at 17:03
  • All four Perl snippets return instantly for me. Are you using an old version of Perl? – ikegami Nov 22 '15 at 17:06
  • @ikegami it's v5.20.2. I'm not interested in optimizing my regexp, just like i'm not interested in using an html parser to find a solution to something. I was just wonder why there are so different implementation between java and perl/php. Does Java regexp routine need a revision? Can this be considered a bug? That was the point of my question – Jack Nov 22 '15 at 18:30
  • @AndreaTosoni I don't think it's a matter of language here; it looks like a matter of implementation to me. – Jack Nov 22 '15 at 18:33
  • 1
    That regexp returns instantly for that input with Perl 5.20.2 // you're the one who asked hope to change it – ikegami Nov 22 '15 at 23:26

0 Answers0