I'm one of those people who think using regex in this situation is a bad idea.
Even if you just want to match a href
attribute from a <a>
tag, your regex expression will still run through the whole html document, which make any regex based solution cluttered, unsafe and bloated.
Plus, matching href attributes from tags with a XML parser is all but overkill.
I have been parsing html pages every weeks for at least 2 years now. At first, I was using full regex solutions, I was thinking it's easier and simpler than using a HTML parser.
But I had to come back on my code quite a lot, for many reasons :
- the source code had changed
- one of the source page had broken html and I didn't tested it
- I didn't try my code for every pages of the source, only to find out a few of them didn't work.
- ...
I found that fixing long regex patterns is not exactly the funniest thing, you have to put your mind over it again and again.
What I usually from now on is :
- using
tidy
to clean the html source.
- Use
DOM
+ Xpath
to actually parse the page and extract the parts I want.
- Use regexes only on small text-only parts (like the trimed
textContent
of a node)
The code is far more robust, I don't have to spend 2hrs on a long regex pattern to find out why it isn't working for 1% of the sources, it just feel proper.
Now, even in cases where I'm not concerned about closing tags and I have a pretty specific structure, I'm still using DOM based solutions, to keep improving my skills with DOM libraries and just produce better code.
I don't like to see on here people who just comment "Don't use regex on html" on every html+regex tagged question, without providing sample code or something to start with.
Here is an example to match href attributes from links in PHP, just to show that using a HTML parser for those common tasks isn't overkill at all.
$dom = new DOMDocument();
$dom->loadHTML($html);
// loop on every links
foreach($dom->getElementsByTagName('a') as $link) {
// get href attribute
$href = $link->getAttribute('href');
// do whatever you want with them...
}
I hope this is helping somehow.