I'm trying to write a robot that will be fetching html parsing it daily. Now for parsing html i could use just string functions like explode, or regural expressions, but I found the dom xpath code much cleaner, so now I can make a configuration of all the sites I have to spider and tags I have to strip out like:
'http://examplesite.com' => '//div/a[@class="articleDesc"]/@href'
So the code looks like this
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//body/div[@class="articleDesc"]');
foreach ($tags as $tag)
echo $tag->nodeValue . "\n";
So with this I get all the div tags with class article description, which i great. But I noticed that all the html tags inside the div tag are stripped out. I wonder how would I get the whole contents of that div I'm looking at.
I also find it hard to see any proper documentation for $xpath->query() to see how to form the string. The php site doesn't tell much about the exact formation of it. Still, my main problem i