4

I am loading a bunch of rss feeds using DOM and sometimes one will 404 instead of producing the file. The problem is that the web-server sends out an html 404 page in place of the expected xml file so using this code:

$rssDom = new DOMDocument();
$rssDom->load($url);
$channel = $rssDom->getElementsByTagName('channel');
$channel = $channel->item(0);
$items = $channel->getElementsByTagName('item');

I get this warning:

Warning: DOMDocument::load() [domdocument.load]: Entity 'nbsp' not defined

Followed by this error:

Fatal error: Call to a member function getElementsByTagName() on a non-object

Normally, this code works fine, but on the occasion that I get a 404 it fails to do anything. I tried a standard try-catch around the load statement but it doesn't seem to catch it.

Salman A
  • 262,204
  • 82
  • 430
  • 521
fishpen0
  • 619
  • 1
  • 10
  • 23
  • If "Entity nbsp; is not defined", perhaps the 404 returned an XML (not HTML) source? (` ` is not defined in XML.) –  May 01 '12 at 09:30

5 Answers5

8

You can suppress the output of parsing errors with

libxml_use_internal_errors(true);

To check whether the returned response is a 404 you can check the $http_response_header after the call to DOMDocument::load()

Example:

libxml_use_internal_errors(true);
$rssDom = new DOMDocument();
$rssDom->load($url);
if (strpos($http_response_header[0], '404')) {
    die('file not found. exiting.');
}

The alternative would be to use file_get_contents and then check the response header and if its not a 404 load the markup with DOMDocument::loadXml. This would prevent DOMDocument from parsing invalid XML.

Note that all this assumes that the server correctly returns a 404 header in the response.

Gordon
  • 312,688
  • 75
  • 539
  • 559
  • +1, although I think I preferred it when I didn't know about `$http_response_headers` at all. PHP, how low can you go? – Jon May 01 '12 at 09:37
  • @Jon yeah, I'd prefer to have a function like `get_last_response_headers()` instead of `$http_response_headers` magically populating after an http call. it's so unobvious. – Gordon May 01 '12 at 09:43
  • Typo? `$http_response_headers` is not defined, while `$http_response_header` is defined thou. The doc page [`$http_response_header`](http://php.net/manual/en/reserved.variables.httpresponseheader.php) linked to the singular `header` as well. – Lionel Chan Nov 28 '12 at 07:40
2

Load the HTML manually with file_get_contents or curl (which allows you to do your own error checks) and if all goes well then feed the results to DOMDocument::loadHTML.

There are lots of curl examples here (e.g. look at this one, although it's surely not the best); to get the HTTP status code you would use curl_getinfo.

Community
  • 1
  • 1
Jon
  • 428,835
  • 81
  • 738
  • 806
  • I used [this answer](http://stackoverflow.com/a/4358138/2257664) to check the HTTP status code. – A.L Dec 04 '14 at 15:30
0

to avoid the warning, you could use LIBXML_NOWARNING (note: suppressing warnings normally isn't a good thing to do).

the more important problem here is the fatal error: to avoid this, you should check if the document has been loaded correctly. to to this, just save the load()s return-value and ise it:

$loaded = $rssDom->load($url, LIBXML_NOWARNING);
if($loaded){
    $channel = $rssDom->getElementsByTagName('channel');
    $channel = $channel->item(0);
    $items = $channel->getElementsByTagName('item');
}else{
    // show error-message or something like that
}
oezi
  • 51,017
  • 10
  • 98
  • 115
0

Like this:

$rssDom = new DOMDocument();
if($rssDom->load($url)) {
   $channel = $rssDom->getElementsByTagName('channel');
   $channel = $channel->item(0);
   $items = $channel->getElementsByTagName('item');
}
Bjørne Malmanger
  • 1,457
  • 10
  • 11
0

In case someone needs a solution, this works like charm:

$objDOM = new DOMDocument();
$loaded=@$objDOM->load(url);

if (!$loaded){
    //something went terribly wrong
} else {
    //this is going ok!!
}

This works as we supress warnings by '@' and load returns true or false in case of errors.

Reinherd
  • 5,476
  • 7
  • 51
  • 88