1

I'm trying to use preg_match_all to scan the source of a page and pull all links that are mailto: links into one array and all links that are not mailto: links into another array. Currently I'm using:

$searches = array('reg'=>'/href(=|=\'|=\")(?!mailto)(.+)\"/i','mailto'=>'/href(=|=\'|=\")(?=mailto)(.+)\"/i');
foreach ($searches as $key=>$search)
{
    preg_match_all($search,$source,$found[$key]);
}

The mailto: links search is working perfectly, but I can't find the reason why the non mailto: link search is pulling both mailto: and non-mailto: links, even with the negative look ahead assertion in place. What am I doing wrong?

mtylerb
  • 25
  • 6
  • 2
    [**The pony, he comes**](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) <-- the canonical reference for why regex parsing of [X]HTML is a bad idea –  Feb 12 '12 at 22:53

2 Answers2

2

A saner solution that isn't so fragile would be to use DOMDocument...

$dom = new DOMDocument;

$dom->loadHTML($html);

$mailLinks = $nonMailLinks = array();

$a = $dom->getElementsByTagName('a');

foreach($a as $anchor) {
   if ($anchor->hasAttribute('href')) {
      $href = trim($anchor->getAttribute('href'));
      if (substr($href, 0, 7) == 'mailto:') {
            $mailLinks[] = $href;
      } else {
            $nonMailLinks[] = $href;
      }
   }
}

CodePad.

alex
  • 479,566
  • 201
  • 878
  • 984
0

Your regex looks for the shortest alternative here:

 (=|=\'|=\")

You either need to sort that = last, or use the more common:

 =[\'\"]?

Alternatively / or otherwise exchange the .+? for the more explicit/restrictive [^\'\">]+. So the negative assertion won't fail against '"mailto:' as matched by .+

mario
  • 144,265
  • 20
  • 237
  • 291