Why am I not getting back any images here?

Question

$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$html = @file_get_contents($url);

$doc = new DOMDocument();
@$doc->loadHTML($html);
$xml = @simplexml_import_dom($doc);
$images = $xml->xpath('//img');

var_dump($images);
die();

Output is:

array(0) { }

However, in the page source I see this:

<img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />

Edit: It appears $html's contents stop at the <body> tag for this page. Any idea why?

What happens, if you remove the `@` in front of `file_get_contents()`? (actually, if you remove any `@` in that code) — Boldewyn, Apr 19 '11 at 16:08
Yes, I'd remove those `@` signs. Hopefully, you'll see some errors. — Blender, Apr 19 '11 at 16:10
Perhaps the page is looking for some specific headers to be set in the request to prevent bots from grabbing the content. Try using `curl` instead and set the same headers as your browser. Use fiddler2 on Windows to see the browsers headers and something like Paros on Linux. — Treffynnon, Apr 19 '11 at 16:14
var_dump is not going to show you any information about DOM* objects because of this http://stackoverflow.com/questions/4776093/why-var-dump-cant-print-domdocument-object-only-with-printdom-savehtml-its/4776208#4776208 — akond, Apr 26 '11 at 14:33
Is this `` the value of `array(0){}`, if that is the case, image would not display... try dumping `var_dump($images[0])` — J Bourne, May 02 '11 at 09:30

Harmon Wood · Accepted Answer · 2011-04-26T21:33:08.070

It appears $html's contents stop at the tag for this page. Any idea why?

Yes, you must provide this page with a valid user agent.

$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_exec($ch);

outputs everything to the ending </html> including your requested <img border="0" width="336" height="69" src="/images/w3schoolslogo.gif" alt="W3Schools.com" style="margin-top:5px;" />

When a simple wget or curl without the user agent returns only up to the <body> tag.

$url = 'http://www.w3schools.com/js/js_loop_for.asp';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);

$doc = new DOMDocument();
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
$images = $xml->xpath('//img');

var_dump($images);
die();

EDIT: My first post stated that there was still an issue with xpath... I was just not doing my due diligence and the updated code above works great. I forgot to force curl to output to a string rather then print to the screen(as it does by default).

score 0 · Answer 2 · answered Apr 19 '11 at 17:12

0

Why bring simplexml into the mix? You're already loading the HTML from w3fools into the DOM class, which has a perfectly good XPath query engine in it already.

[...snip...]
$doc->loadHTML($html);
$xpath = new DOMXPath($doc)
$images = $xpath->xpath('//img');
[...snip...]

answered Apr 19 '11 at 17:12

Marc B

356,200
43
426
500

I'm using xpath to simplify things here, I dont want to change all my code everywhere $xpath is later used that is working. I'm looking to find out why it stops @ the tag – barfoon Apr 19 '11 at 18:07

score -1 · Answer 3 · edited Jun 20 '20 at 09:12

-1

The IMG tag is generated by javascript. If you'd downloaded this page via wget, you'd realize there is no IMG tag in the HTML.

Update #1

I believe it is because of user agent string. If I supply "Mozilla/5.0 (X11; Linux i686 on x86_64; rv:2.0) Gecko/20100101 Firefox/4.0" as user agent id, I get the page in whole.

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 26 '11 at 14:44

akond

15,865
4
35
55

If I view the source of the page it looks like the gif mentioned in the post isnt generated by Javascript. Also, im not using wget – barfoon Apr 26 '11 at 15:19
If I go to the page and hit view source, I get everything including images that dont appear to be generated by scripts. If I request it programmatically (which is what iwebtool is doing), the result is cut off @ the tag - which is what I'm asking for an explanation for in my question here. – barfoon Apr 26 '11 at 20:16

Why am I not getting back any images here?

3 Answers3

Update #1