I want to get all the href links in the html. I came across two possible ways. One is the regex:
$input = urldecode(base64_decode($html_file));
$regexp = "href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)\s*";
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
foreach($matches as $match) {
echo $match[2] ;//= link address
echo $match[3]."<br>" ;//= link text
}
}
And the other one is creating DOM document and parsing it:
$html = urldecode(base64_decode($html_file));
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
I dont know which one of this is efficient. But The code will be used many times. So i want to clarify which is the better one to go with. Thank You!