You should probably not be using regular expressions
- HTML is not regular
- Regexes may match today, but what about tomorrow?
Say you've got a file of HTML where you're trying to extract URLs from tags.
<img src="http://example.com/whatever.jpg">
So you write a regex like this (in Perl):
if ( $html =~ /<img src="(.+)"/ ) {
$url = $1;
}
In this case, $url will indeed contain http://example.com/whatever.jpg. But what happens when you start getting HTML like this:
<img src='http://example.com/whatever.jpg'>
or
<img src=http://example.com/whatever.jpg>
or
<img border=0 src="http://example.com/whatever.jpg">
or
<img
src="http://example.com/whatever.jpg">
or you start getting false positives from
<!-- <img src="http://example.com/outdated.png"> -->