Simple HTML DOM
Simple HTML Dom provides an object-oriented way of accessing the html dom in php. I've used it before with alot of success, but it will choke on a large dom structure. A nice feature is the ability to manipulate the dom and save it using this oo-design. It allows you to perform selector-searches of the dom:
// Find all <div> which attribute id=foo
$ret = $html->find('div[id=foo]');
or:
// Find all <li> in <ul>
foreach($html->find('ul') as $ul)
{
foreach($ul->find('li') as $li)
{
// do something...
}
}
// Find first <li> in first <ul>
$e = $html->find('ul', 0)->find('li', 0);
And it allows for traversal:
echo $html->getElementById("div1")->childNodes(1)->childNodes(1)->childNodes(2)->getAttribute('id');
DOMDocument
As others have noted, you can also use the DOMDocument as well.
XPath
From my personal experience, while xpath is harder to get working, it's worth it if you're only interested in extracting info from the dom.
While not perfectly related to the info you're trying to extract, here's how I've used xpath to extract info from an xml document:
The XML:
<?xml version="1.0" encoding="utf-8"?>
<Report>
<CampaignPerformanceReportColumns>
<Column name="AccountName" />
...
<Column name="CampaignId" />
</CampaignPerformanceReportColumns>
<Table>
<Row>
<CampaignName value="Auctions" />
<GregorianDate value="8/11/2010" />
...
<CampaignId value="60312546" />
</Row>
<Row>
<CampaignName value="Auctions" />
<GregorianDate value="8/11/2010" />
...
<CampaignId value="60312546" />
</Row>
<Row>
<CampaignName value="Auctions 2" />
<GregorianDate value="8/11/2010" />
...
<CampaignId value="603125467" />
</Row>
</Table>
</Report>
PHP:
$xml = simplexml_load_file($file);
// Get each Row
$result = $xml->xpath("Table/Row");
// Get the CampaignId of each Row
$result = $xml->xpath("//Row/CampaignId");
XPath has many more features; I'd encourage you to explore it if you need to extract alot of info from any xml-structured document.