0

I want to get all the href links in the html. I came across two possible ways. One is the regex:

$input = urldecode(base64_decode($html_file));
 $regexp = "href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)\s*";
 if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
     foreach($matches as $match) {
           echo $match[2] ;//= link address
           echo $match[3]."<br>" ;//= link text
      }
  }

And the other one is creating DOM document and parsing it:

             $html = urldecode(base64_decode($html_file));
             //Create a new DOM document
             $dom = new DOMDocument;

            //Parse the HTML. The @ is used to suppress any parsing errors
             //that will be thrown if the $html string isn't valid XHTML.
             @$dom->loadHTML($html);

            //Get all links. You could also use any other tag name here,
            //like 'img' or 'table', to extract other tags.
            $links = $dom->getElementsByTagName('a');

            //Iterate over the extracted links and display their URLs
            foreach ($links as $link){
                //Extract and show the "href" attribute.
                     echo $link->nodeValue;
                     echo $link->getAttribute('href'), '<br>';
            }

I dont know which one of this is efficient. But The code will be used many times. So i want to clarify which is the better one to go with. Thank You!

Ruslan Osmanov
  • 20,486
  • 7
  • 46
  • 60
vikram
  • 77
  • 8
  • What about `jsoup`? – Murat Karagöz Jan 11 '17 at 12:04
  • 1
    Surely a benchmark test would be the best idea here? That will tell you which is quicker. – Tom Jan 11 '17 at 12:05
  • Undoubtedly, you shouldn't parse HTML with regular expressions! Use DOM or SimpleXML for relatively small documents, and SAX/pull parsers for large documents, e.g. XML parser, XMLReader – Ruslan Osmanov Jan 11 '17 at 12:06
  • I'm voting to close this question as off-topic because the question should be asked on http://codereview.stackexchange.com/ – Ruslan Osmanov Jan 11 '17 at 12:07
  • Required reading: http://stackoverflow.com/a/1732454/ – piet.t Jan 11 '17 at 12:11
  • 1
    How would your regex handle a webpage about HTML describing the `href`-attribute with examples? That would contains lots of "href=" which are not links but content.... – piet.t Jan 11 '17 at 12:11

0 Answers0