14

I trying to get the "link" elements from certain webpages. I can't figure out what i'm doing wrong though. I'm getting the following error:

Severity: Warning

Message: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: no name in Entity, line: 536

Filename: controllers/test.php

Line Number: 34

Line 34 is the following in the code:

      $dom->loadHTML($html);

my code:

            $url = "http://www.amazon.com/";

    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
    if($html = curl_exec($ch)){

        // parse the html into a DOMDocument
        $dom = new DOMDocument();

        $dom->recover = true;
        $dom->strictErrorChecking = false;

        $dom->loadHTML($html);

        $hrefs = $dom->getElementsByTagName('a');

        echo "<pre>";
        print_r($hrefs);
        echo "</pre>";

        curl_close($ch);


    }else{
        echo "The website could not be reached.";
    }
David
  • 10,418
  • 17
  • 72
  • 122

3 Answers3

42

It means some of the HTML code is invalid. THis is just a warning, not an error. Your script will still process it. To suppress the warnings set

 libxml_use_internal_errors(true);

Or you could just completely suppress the warning by doing

@$dom->loadHTML($html);
Kris
  • 6,094
  • 2
  • 31
  • 46
  • Are you sure you set libxml_use_internal_errors(true); at the top of the php script? I also updated my answer to provide another alternative – Kris Sep 08 '12 at 05:53
  • that hides the warning, but it's returning an empty object – David Sep 08 '12 at 05:57
  • That is weird. I ran your exact code and it worked fine. It returned a bunch of objects. Your print_r statement outputted DOMNodeList Object ( [length] => 81 ) – Kris Sep 08 '12 at 06:02
  • -1 For suggesting suppression of all errors on that line. This will lead to a debugging nightmare. I would have given you a +1 if it were not for that. – Gerry Aug 01 '13 at 02:28
  • @Gerry , Kris is wrong in one thing - script will not process things you wanted (it will skip them), @ operator not only hides messages, it is also forcing the script to proceed anyway - this might be little dangerous BUT in times like loading some shi*t*ty html it can be the most powerful tool :) – jave.web Aug 06 '13 at 03:14
  • 1
    this is an awful solution, NEVER do this....if you want to supress error output to the browser, you could do something like ob_start();...commands here....$buf=ob_get_clean() and then check $buf for any error output which will let you keep the errors, but stop the browser output – Christopher Thomas Jan 03 '14 at 18:01
15

This may be caused by a rogue & symbol that is immediately succeeded by a proper tag. As otherwise you would receive a missing ; error. See: Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,.

The solution is to - replace the & symbol with &amp;
or if you must have that & as it is then, may be you could enclose it in: <![CDATA[ - ]]>

Community
  • 1
  • 1
Ujjwal Singh
  • 4,908
  • 4
  • 37
  • 54
  • In my case I outputted a variable containing an ampersand between `` tags i.e. `$variable['ingredient'] = "7 & 8"; $tbody .= "" . $variable['ingredient'] . "";` which lead to this error: `Message: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: no name in Entity`. – Hmerman6006 May 03 '21 at 09:50
2

The HTML is poorly formed. If formed poorly enough loading the HTML into the DOM Document might even fail. If loadHTML is not working then suppressing the errors is pointless. I suggest using a tool like HTML Tidy to "clean up" the poorly formed HTML if you are unable to load the HTML into the DOM.

HTML Tidy can be found here http://www.htacg.org/tidy-html5/

DeltaLee
  • 391
  • 1
  • 3
  • 8