In Java:
Time ago I wrote a code that downloads a webpage and then parses it to find a specific value. I used a regexp like this and everything went fine.
Pattern p = Pattern.compile("<tr.*?>.*?<td.*?>FOO:</td>.*?<td.*?>(.*?)</td>.*?</tr>", Pattern.DOTALL);
Matcher m = p.matcher(page);
m.find();
Today the webpage changed, the string FOO is no longer present, and suddenly m.find() does not return anymore, blocking my whole application.
I then started to investigate and doing some debug, and found that with a normal html page (200kb, 3000 lines), if FOO is present the above regexp works fast, otherwise it takes hours.
Then I said.. Ok, probably the complexity of this expression justifies the long time waiting. But I wanted to verify my assumption, so I prepared some test in other languages and slightly modified the above pattern.
After I saved the webpage in a file and I modified it and inserted FOO where it was supposed to be. Then I wrote 4 tests:
- to match FOO with DOT_ALL
- to unmatch BAR with DOT_ALL
- to unmatch FOO without DOT_ALL
- to unmatch BAR without DOT_ALL
You can reach the test page here: http://pastebin.com/2S9fEpxD
In perl:
cat page.html | perl -e '$str = do { local $/; <> }; $str =~ /<tr.*?>.*?<td.*?>FOO:<\/td>.*?<td.*?>(.*?)<\/td>.*?<\/tr>/s; print "$1\n";'
cat page.html | perl -e '$str = do { local $/; <> }; $str =~ /<tr.*?>.*?<td.*?>BAR:<\/td>.*?<td.*?>(.*?)<\/td>.*?<\/tr>/s; print "$1\n";'
cat page.html | perl -e '$str = do { local $/; <> }; $str =~ /<tr.*?>.*?<td.*?>FOO:<\/td>.*?<td.*?>(.*?)<\/td>.*?<\/tr>/; print "$1\n";'
cat page.html | perl -e '$str = do { local $/; <> }; $str =~ /<tr.*?>.*?<td.*?>BAR:<\/td>.*?<td.*?>(.*?)<\/td>.*?<\/tr>/; print "$1\n";'
Test 1,2 and 4 instantly return. Test 3 takes 19 seconds to finish.
In PhP:
preg_match( '#<tr.*?>.*?<td.*?>FOO:</td>.*?<td.*?>(.*?)</td>.*?</tr>#s',file_get_contents('page.html'), $vals);
preg_match( '#<tr.*?>.*?<td.*?>BAR:</td>.*?<td.*?>(.*?)</td>.*?</tr>#s',file_get_contents('page.html'), $vals);
preg_match( '#<tr.*?>.*?<td.*?>FOO:</td>.*?<td.*?>(.*?)</td>.*?</tr>#',file_get_contents('page.html'), $vals);
preg_match( '#<tr.*?>.*?<td.*?>BAR:</td>.*?<td.*?>(.*?)</td>.*?</tr>#',file_get_contents('page.html'), $vals);
All 4 tests return instantly.
In Java, again:
Just to complete my test, I also performed test 3 and 4 in Java, and it took hours, just like test 2 (but not 1, that matches, and does it quickly)
This is the code I used (test 3 in this case):
FileReader fr = new FileReader("page.html");
char[] buff = new char[(int)new File("page.html").length()];
fr.read(buff);
fr.close();
String page = new String(buff);
Pattern p = Pattern.compile("<tr.*?>.*?<td.*?>FOO:</td>.*?<td.*?>(.*?)</td>.*?</tr>" /*, Pattern.DOTALL*/);
Matcher m = p.matcher(page);
System.out.println(m.find());
Conclusion
PhP behaves better than Perl, and extremely better than Java. Why? If php is able to tell quickly when this regexp match or not, why shouldn't the same technology be ported in Java? I always though that regular expression world was fully understood by human kind and that there were no other discoveries left to do.