To find all links in html, which is right way? regex or parsing DOM

Question

I want to get all the href links in the html. I came across two possible ways. One is the regex:

$input = urldecode(base64_decode($html_file));
 $regexp = "href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)\s*";
 if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
     foreach($matches as $match) {
           echo $match[2] ;//= link address
           echo $match[3]."<br>" ;//= link text
      }
  }

And the other one is creating DOM document and parsing it:

             $html = urldecode(base64_decode($html_file));
             //Create a new DOM document
             $dom = new DOMDocument;

            //Parse the HTML. The @ is used to suppress any parsing errors
             //that will be thrown if the $html string isn't valid XHTML.
             @$dom->loadHTML($html);

            //Get all links. You could also use any other tag name here,
            //like 'img' or 'table', to extract other tags.
            $links = $dom->getElementsByTagName('a');

            //Iterate over the extracted links and display their URLs
            foreach ($links as $link){
                //Extract and show the "href" attribute.
                     echo $link->nodeValue;
                     echo $link->getAttribute('href'), '<br>';
            }

I dont know which one of this is efficient. But The code will be used many times. So i want to clarify which is the better one to go with. Thank You!

Surely a benchmark test would be the best idea here? That will tell you which is quicker. — Tom, Jan 11 '17 at 12:05
Undoubtedly, you shouldn't parse HTML with regular expressions! Use DOM or SimpleXML for relatively small documents, and SAX/pull parsers for large documents, e.g. XML parser, XMLReader — Ruslan Osmanov, Jan 11 '17 at 12:06
I'm voting to close this question as off-topic because the question should be asked on http://codereview.stackexchange.com/ — Ruslan Osmanov, Jan 11 '17 at 12:07
How would your regex handle a webpage about HTML describing the `href`-attribute with examples? That would contains lots of "href=" which are not links but content.... — piet.t, Jan 11 '17 at 12:11

To find all links in html, which is right way? regex or parsing DOM

0 Answers0