Acceptable use of Regex in HTML parsing?

Question

There is a lot of argument back and forth over when and if it is ever appropriate to use a regex to parse html.

As a common problem that comes up is parsing links from html my question is, would using a regex be appropriate if all you were looking for was the href value of <a> tags in a block of HTML? In this scenario you are not concerned about closing tags and you have a pretty specific structure you are looking for.

It seems like significant overkill to use a full html parser. While I have seen questions and answers indicating the using a regex to parse URLs, while largely safe is not perfect, the extra limitations of structured <a> tags would appear to provide a context where one should be able to achieve 100% accuracy without breaking a sweat.

Thoughts?

Alohci · Accepted Answer · 2011-03-10T01:07:48.283

4

Consider this valid html:

<!DOCTYPE html>
<title>Test Case</title>
<p>
<!-- <a href="url1"> -->
<span class="><a href='url2'>"></span>
<a href='my">url<'>click</a>
</p>

What is the list of urls to be extracted? A parser would say just a single url with value my">url<. Would your regular expression?

edited Mar 10 '11 at 01:07

answered Mar 10 '11 at 00:51

Alohci

78,296
16
112
156

You didn't even have to get nasty there with CDATA and its ilk to present a compelling reason not to use regexes on HTML. – Borealid Mar 10 '11 at 01:10
1

The html comment is a good example but your wacky class is I believe invalid html. – Endophage Mar 10 '11 at 05:42
1

@Endophage - If you doubt my validity claim, it's easy to check it here: http://validator.w3.org/#validate_by_input . Just copy and paste my example in and click the "Check" button. – Alohci Mar 10 '11 at 09:00
@Alohci... interesting... I've had problems before with generated html that ended up having < or > in an attribute value – Endophage Mar 10 '11 at 17:24

Yann Milin · Answer 2 · 2011-03-10T11:59:50.550

I'm one of those people who think using regex in this situation is a bad idea.

Even if you just want to match a href attribute from a <a> tag, your regex expression will still run through the whole html document, which make any regex based solution cluttered, unsafe and bloated.

Plus, matching href attributes from tags with a XML parser is all but overkill.

I have been parsing html pages every weeks for at least 2 years now. At first, I was using full regex solutions, I was thinking it's easier and simpler than using a HTML parser.

But I had to come back on my code quite a lot, for many reasons :

the source code had changed
one of the source page had broken html and I didn't tested it
I didn't try my code for every pages of the source, only to find out a few of them didn't work.
...

I found that fixing long regex patterns is not exactly the funniest thing, you have to put your mind over it again and again.

What I usually from now on is :

using tidy to clean the html source.
Use DOM + Xpath to actually parse the page and extract the parts I want.
Use regexes only on small text-only parts (like the trimed textContent of a node)

The code is far more robust, I don't have to spend 2hrs on a long regex pattern to find out why it isn't working for 1% of the sources, it just feel proper.

Now, even in cases where I'm not concerned about closing tags and I have a pretty specific structure, I'm still using DOM based solutions, to keep improving my skills with DOM libraries and just produce better code.

I don't like to see on here people who just comment "Don't use regex on html" on every html+regex tagged question, without providing sample code or something to start with.

Here is an example to match href attributes from links in PHP, just to show that using a HTML parser for those common tasks isn't overkill at all.

$dom = new DOMDocument(); 
$dom->loadHTML($html); 

// loop on every links
foreach($dom->getElementsByTagName('a') as $link) { 
    // get href attribute
    $href = $link->getAttribute('href');
    // do whatever you want with them...
}

I hope this is helping somehow.

Thanks for all the info. I've tried using PHP's DOM parser (I have no option to change from PHP) and for situations where I need to parse then output it's just too damn slow... It adds somewhere in the region of 4 seconds to a page load over a regex based solution. — Endophage, Mar 10 '11 at 17:28

score 0 · Answer 3 · edited May 23 '17 at 11:47

0

I proposed this one :

<a.*?href=["'](?<url>.*?)["'].*?>(?<name>.*?)</a>

On this thread

Eventually it can fail for what can be in name.

edited May 23 '17 at 11:47

Community

1
1

answered Mar 09 '11 at 22:51

M'vy

5,696
2
30
43

Read the question fore carefully: "would using a regex be appropriate if all you were looking for was the href value of tags in a block of HTML?" I already have a regex that does it. I'm looking for whether people (who typically have a kneejerk reaction against using a regex with html) would consider this a legitimate use case where a regex is the appropriate solution. – Endophage Mar 10 '11 at 00:22

Acceptable use of Regex in HTML parsing?

3 Answers3