Possible Duplicate:
PHP HTML DomDocument getElementById problems
I'm trying to extract info from Google searches in PHP and find that I can read the search urls without problem, but getting anything out of them is a whole different issue. After reading numerous posts, and applicable PHP docs, I came up with the following
// get large panoramas of montana
$url = 'http://www.google.com/search?q=montana+panorama&tbm=isch&biw=1408&bih=409';
$html = file_get_contents($url);
// was getting tons of "entity parse" errors, so added
$html = htmlentities($html, ENT_COMPAT, 'UTF-8', true); // tried false as well
$doc = new DOMDocument();
//$doc->strictErrorChecking = false; // tried both true and false here, same result
$result = $doc->loadHTML($html);
//echo $doc->saveHTML(); this shows that the tags I'm looking for are in fact in $doc
if ($result === true)
{
var_dump($result); // prints 'true'
$tags = $doc->getElementById('center_col');
$tags = $doc->getElementsByTagName('td');
var_dump($tags); // previous 2 lines both print NULL
}
I've verified that the ids and tags I'm looking for are in the html by error_log($html) and in the parsed doc with $doc->SaveHTNL(). Anyone see what I'm doing wrong?
Edit:
Thanks all for the help, but I've hit a wall with DOMDocument. Nothing in any of the docs, or other threads, works with Google image queries. Here's what I tried:
I looked at the @Jon link tried all the suggestions there, looked at the getElementByID docs and read all the comments there as well. Still getting empty result sets. Better than NULL, but not much.
I tried the xpath trick:
$xpath = new DOMXPath($doc);
$ccol = $xpath->query("//*[@id='center_col']");
Same result, an empty set.
I did a error_log($html) directly after the file read and the document has a doctype "" so it's not that.
I also see there that user "carl2088" says "From my experience, getElementById seem to work fine without any setups if you have loaded a HTML document". Not in the case of Google image queries, it would appear.
In desperation, I tried
echo count(explode('center_col', $html))
to see if for some strange reason it disappears after the initial error_log($html). It's definitely there, the string is split into 4 chunks.
I checked my version of PHP (5.3.15) complied Aug. 25 2012, so it's not a version too old to support getElementByID.
Before yesterday, I had been using an extremely ugly series of "explodes" to get the info, and while it's horrid code, it took 45 minutes to write and it works.
I'd really like to ditch my "explode" hack, but 5 hours to achieve nothing vs 45 minutes to get something that works, makes it really difficult to do things the right way.
If anyone else with experience using DOMDocument has some additional tricks I could try, it would be much appreciated.