0

There is a lot of argument back and forth over when and if it is ever appropriate to use a regex to parse html.

As a common problem that comes up is parsing links from html my question is, would using a regex be appropriate if all you were looking for was the href value of <a> tags in a block of HTML? In this scenario you are not concerned about closing tags and you have a pretty specific structure you are looking for.

It seems like significant overkill to use a full html parser. While I have seen questions and answers indicating the using a regex to parse URLs, while largely safe is not perfect, the extra limitations of structured <a> tags would appear to provide a context where one should be able to achieve 100% accuracy without breaking a sweat.

Thoughts?

Endophage
  • 21,038
  • 13
  • 59
  • 90

3 Answers3

4

Consider this valid html:

<!DOCTYPE html>
<title>Test Case</title>
<p>
<!-- <a href="url1"> -->
<span class="><a href='url2'>"></span>
<a href='my">url<'>click</a>
</p>

What is the list of urls to be extracted? A parser would say just a single url with value my">url<. Would your regular expression?

Alohci
  • 78,296
  • 16
  • 112
  • 156
  • You didn't even have to get nasty there with CDATA and its ilk to present a compelling reason not to use regexes on HTML. – Borealid Mar 10 '11 at 01:10
  • 1
    The html comment is a good example but your wacky class is I believe invalid html. – Endophage Mar 10 '11 at 05:42
  • 1
    @Endophage - If you doubt my validity claim, it's easy to check it here: http://validator.w3.org/#validate_by_input . Just copy and paste my example in and click the "Check" button. – Alohci Mar 10 '11 at 09:00
  • @Alohci... interesting... I've had problems before with generated html that ended up having < or > in an attribute value – Endophage Mar 10 '11 at 17:24
2

I'm one of those people who think using regex in this situation is a bad idea.

Even if you just want to match a href attribute from a <a> tag, your regex expression will still run through the whole html document, which make any regex based solution cluttered, unsafe and bloated.

Plus, matching href attributes from tags with a XML parser is all but overkill.

I have been parsing html pages every weeks for at least 2 years now. At first, I was using full regex solutions, I was thinking it's easier and simpler than using a HTML parser.

But I had to come back on my code quite a lot, for many reasons :

  • the source code had changed
  • one of the source page had broken html and I didn't tested it
  • I didn't try my code for every pages of the source, only to find out a few of them didn't work.
  • ...

I found that fixing long regex patterns is not exactly the funniest thing, you have to put your mind over it again and again.

What I usually from now on is :

  • using tidy to clean the html source.
  • Use DOM + Xpath to actually parse the page and extract the parts I want.
  • Use regexes only on small text-only parts (like the trimed textContent of a node)

The code is far more robust, I don't have to spend 2hrs on a long regex pattern to find out why it isn't working for 1% of the sources, it just feel proper.

Now, even in cases where I'm not concerned about closing tags and I have a pretty specific structure, I'm still using DOM based solutions, to keep improving my skills with DOM libraries and just produce better code.

I don't like to see on here people who just comment "Don't use regex on html" on every html+regex tagged question, without providing sample code or something to start with.

Here is an example to match href attributes from links in PHP, just to show that using a HTML parser for those common tasks isn't overkill at all.

$dom = new DOMDocument(); 
$dom->loadHTML($html); 

// loop on every links
foreach($dom->getElementsByTagName('a') as $link) { 
    // get href attribute
    $href = $link->getAttribute('href');
    // do whatever you want with them...
}

I hope this is helping somehow.

Yann Milin
  • 1,335
  • 1
  • 11
  • 22
  • Thanks for all the info. I've tried using PHP's DOM parser (I have no option to change from PHP) and for situations where I need to parse then output it's just too damn slow... It adds somewhere in the region of 4 seconds to a page load over a regex based solution. – Endophage Mar 10 '11 at 17:28
0

I proposed this one :

<a.*?href=["'](?<url>.*?)["'].*?>(?<name>.*?)</a>

On this thread

Eventually it can fail for what can be in name.

Community
  • 1
  • 1
M'vy
  • 5,696
  • 2
  • 30
  • 43