0

I'm trying to write a RegEx which finds all links on a webpage with the rel="nofollow" attribute. Mind you, I'm a RegEx newb so please don't be to harsh on me :)

This is what I got so far:

$link = "/<a href=\"([^\"]*)\" rel=\"nofollow\">(.*)<\/a>/iU";

Obviously this is very flawed. Any link with any other attribute or styled a little differently (single quotes) won't be matched.

Linkjuice57
  • 3,473
  • 6
  • 24
  • 23
  • 2
    [Don't. Use. Regex. To. Parse. HTML.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) ... the Pony, he comes. –  Feb 27 '12 at 20:55

2 Answers2

3

You should really use DOM parser for this purpose as any regex based solution will be error prone for this kind of HTML parsing. Consider code like this:

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
// returns a list of all links with rel=nofollow
$nlist = $xpath->query("//a[@rel='nofollow']");
anubhava
  • 761,203
  • 64
  • 569
  • 643
1

Try this:

$link = "/<(a)[^>]*rel\s*=\s*(['\"])nofollow\\2[^>]*>(.*?)<\/\\1>/i";
Paulo Rodrigues
  • 5,273
  • 7
  • 35
  • 58