1

How do I parse a certain link style out of html without it spreading across multiple links to match?

The exact link I am trying to match is:

href="http://www.hotmail.com' rel='external nofollow"

Pay particular attention to the mismatching of ' and " in the above.

What I have tried:

if(preg_match('|href="http(.*?)\' rel=\'(.*?)"|i', $html)){
  echo "Found bad html\n";
}

However that regexp is also matching in perfectly good html across several links. I need to be able to only match within a single link.

Jaime Cross
  • 523
  • 1
  • 3
  • 15
  • 5
    You dont use regex to parse HTML – TJHeuvel Apr 21 '11 at 08:32
  • @TJHeuval - I reckon SO gets several "how to parse HTML with regex" questions every day. – Richard H Apr 21 '11 at 08:35
  • You should adapt your HTML anyway. The way you mix `"` and `'` might also make some browsers trip over that. – mario Apr 21 '11 at 08:36
  • @mario: it is not my html. I am trying to fix other peoples html on the fly before it gets displayed on my site. – Jaime Cross Apr 21 '11 at 08:37
  • 2
    To just fix broken HTML, use Tidy. To parse it, see [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662). – Gordon Apr 21 '11 at 08:37
  • @Jaime Cross - no you can't. Please see the see the _thousands_ of other questions on this exact issue. http://stackoverflow.com/questions/tagged/regex%2bhtml?sort=votes&pagesize=15 – Richard H Apr 21 '11 at 08:38
  • @Jaime, okay you *can* but you **shouldn't**. Checkout http://uk2.php.net/domdocument. – Nick Apr 21 '11 at 08:38
  • @gordon: I considered that but tidy adds a lot for a single simple swap like I need. – Jaime Cross Apr 21 '11 at 08:40
  • 2
    @Richard you **can** parse HTML with Regex. The often cited rant on SO is wrong. It is just not practical to use Regex on HTML the more arbitrary the HTML becomes. And that's why you should use a parser instead, because in those cases it's more reliable. – Gordon Apr 21 '11 at 08:40
  • @Richard: what do you think people used before domdoc and xpath came out exactly? – Jaime Cross Apr 21 '11 at 08:41
  • 1
    @Jaime it adds a function call. Unless you benchmarked that function call and it has significant negative impact, I'd say it adds nothing worth to mention – Gordon Apr 21 '11 at 08:42
  • @Nick I am using domdocument, however that breaks on this particular links bad html, which is why I am using regex to filter it. – Jaime Cross Apr 21 '11 at 08:42
  • @Gordon: of course you can parse very specific cases. But even these tend to be fiddly to get right, and if anything changes you're back to the drawing board. It's so much simpler, robust, extensible to use a parser, the rule should be "Never use a regex to parse html"! YOu'll save yourself a huge amount of time and pain, and learn a really powerful tool in the process. – Richard H Apr 21 '11 at 08:47
  • 1
    Some relevant reading for all those involved is [Parsing HTML the Cthulhu way](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html) – Diarmaid Apr 21 '11 at 08:47
  • @Diarmaid - thank you, I was looking for this but couldn't find it :) – Richard H Apr 21 '11 at 08:48
  • @Richard - I only originally found it on yesterdays HTML/Regex question ;) – Diarmaid Apr 21 '11 at 08:49
  • 1
    @Richard dont get me wrong. I'm all for using a parser. I just dont agree to Regex **cannot** be used on HTML, because it's wrong. And in this specific case, Regex might be the only practical way to solve the problem at hand. – Gordon Apr 21 '11 at 08:54
  • @Richard: Gordon has it right. I am using DomDocument, however that fails on this specific piece of bad html. Hence my need to use the regexp to filter that out. If they change that html and the regex breaks somewhere down the road, fantastic, it might mean they fixed their html :) – Jaime Cross Apr 21 '11 at 08:56
  • 1
    @Jaime just for the record, it doesnt seem to be possible with Tidy either, so please disregard that suggestion of mine earlier on. – Gordon Apr 21 '11 at 09:06
  • @Gordon thanks for the heads up, marios solution worked for my needs, so I am good. :) – Jaime Cross Apr 22 '11 at 02:26

1 Answers1

1

You might be able to adapt your regex by replacing the generic .*? with a negative character class like [^<"'>]+. That usually prevents that it eats up too much.

if(preg_match('| href="(http[^<"\'>]+)\' rel=\'([^<"\'>]+)"|i', $html)){

Better yet: don't hard-code the " and ', but use a character class to match them too:

if(preg_match('| href=["\']http([^<"\'>]+)["\']'
              .' rel=["\']([^<"\'>]*)["\']|i', $html)){

(Oh, now it looks really ugly.)

mario
  • 144,265
  • 20
  • 237
  • 291