Enumerating A HREF items inside a string of HTML

Question

I'm trying to enumerate a list of hyperlinks (specifically the HREF component) from a string of HTML. The contents of each page are not too far off what early versions of Yahoo looked like (a series of hyperlinks broken into groupings by LI and UL tags.

We are parsing a series of previously hand-crafted HTML pages from an old system and want to pull only the meaningful content from each page rather than migrating the entire string. In my testing, my process is straight forward and is as follows:

load the contents of the html page into a string
parse the contents looking for "A" objects, but only after a specific tag with a specific class assigned
for each list found, echo (for testing) the url (and ultimately write that item to our database).

I'm fairly sure that the best way to do this is with a regular expression, but from the examples I could find on stack overflow I wasn't able to get them working correctly (even to echo out found matches) and not much success with the DOM Parser either.

My test data looks like this:

<html>
<body>
<li><a href='beforelist.com'></a></li>
<ul class="summary">
<li><a href='test.com'></a></li>
<li><a href='test2.com'></a></li>
<li><a href='etc.com'></a></li>
</ul>
<li><a href='afterlist.com'></a></li>
<img src='/test.png'>
</body>
</html>

and am looking for output that matches (only after it finds the class='summary':

 test.com
 test2.com
 etc.com

Everything outside of the summary grouping is ignored and is very unpredictable as to what it may include. I'm sure I'm missing something obvious and greatly appreciate any assistance! I never really understood how to write regex patterns correctly. :)

You may be able to get away with regular expressions, but this is still a great read with good advice: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — zneak, Jul 31 '14 at 06:46

TiMESPLiNTER · Accepted Answer · 2014-07-31T11:06:24.567

The way to go is with DOMDocument and DOMXPath never ever parse HTML with regex.

Here's a simple example for your case:

// Create new DOM
$dom = new DOMDocument();
// Import your HTML string into DOM
$dom->loadHTML($html);

// Create new XPath which has the above DOM as resource
$xpath = new DOMXPath($dom);

// Find every ul with class summary and select all the "a"s in it
$links = $xpath->query("//ul[@class='summary']//a");

// Loop through the links
foreach($links as $link) {
    // Print out the href attribute
    var_dump($link->getAttribute('href'));
}

The output of this little PHP snippet is:

string 'test.com' (length=8)
string 'test2.com' (length=9)
string 'etc.com' (length=7)

It's really that easy. The XPath query will find any links which are in an unordered list with class summary. Even if it's a nested list.

Thank you so much for this - it worked and finally makes some sense! Much appreciated! — nuge, Jul 31 '14 at 10:53

score 0 · Answer 2 · answered Jul 31 '14 at 06:58

code with explanation :

<?php
// to retrieve selected html data, try these DomXPath examples:

$html="<html>
<body>
<li><a href='beforelist.com'></a></li>
<ul class='summary'>
<li><a href='test.com'></a></li><li><a href='test2.com'></a></li><li><a href='etc.com'></a></li>
</ul>
<li><a href='afterlist.com'></a></li>
<img src='/test.png'>
</body>
</html>";
$doc = new DOMDocument;
$doc->loadHTML($html);

$xpath = new DOMXpath($doc);

// example 1: for everything with an id
//$elements = $xpath->query("//*[@id]");

// example 2: for node data in a selected id
//$elements = $xpath->query("/html/body/div[@id='yourTagIdHere']");

// example 3: what you are looking for
$elements = $xpath->query("//ul[@class='summary']//li/a");

if (!is_null($elements)) {
  foreach ($elements as $element) {
    echo $element->getAttribute('href'). "\n";

  }
}
?>

demo here : https://eval.in/173506

Enumerating A HREF items inside a string of HTML

2 Answers2