26

I have found several similar questions, but so far, none have been able to help me.

I am trying to output the 'src' of all images in a block of HTML, so I'm using DOMDocument(). This method is actully working, but I'm getting a warning on some pages, and I can't figure out why. Some posts suggested surpressing the warning, but I'd much rather find out why the warning is being generated.

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 10

One example of post->post_content that is generating the error is -

On Wednesday 21st November specialist rights of way solicitor Jonathan Cheal of Dyne Drewett will be speaking at the Annual Briefing for Rural Practice Surveyors and Agricultural Valuers in Petersfield.
<br>
Jonathan is one of many speakers during the day and he is specifically addressing issues of public rights of way and village greens.
<br>
Other speakers include:-
<br>
<ul>
<li>James Atrrill, Chairman of the Agricultural Valuers Associates of Hants, Wilts and Dorset;</li>
<li>Martin Lowry, Chairman of the RICS Countryside Policies Panel;</li>
<li>Angus Burnett, Director at Martin & Company;</li>
<li>Esther Smith, Partner at Thomas Eggar;</li>
<li>Jeremy Barrell, Barrell Tree Consultancy;</li>
<li>Robin Satow, Chairman of the RICS Surrey Local Association;</li>
<li>James Cooper, Stnsted Oark Foundation;</li>
<li>Fenella Collins, Head of Planning at the CLA; and</li>
<li>Tom Bodley, Partner at Batcheller Monkhouse</li>
</ul>

I can post some more examples of what post->post_content contains if that would be helpful?

I have allowed access to a development site temporarily, so you can see some examples [Note - links no longer accessable as question has been answered] -

Any tips on how to resolve this? Thanks.

$dom = new DOMDocument();
$dom->loadHTML(apply_filters('the_content', $post->post_content)); // Have tried stripping all tags but <img>, still generates warning
$nodes = $dom->getElementsByTagName('img');
foreach($nodes as $img) :
    $images[] = $img->getAttribute('src');
endforeach;
David Gard
  • 11,225
  • 36
  • 115
  • 227
  • 1
    Showing the line that caused the error would definitely make debugging it easier. – lonesomeday Feb 01 '13 at 14:27
  • ??? The warning is on `DOMDocument::loadHTML();`, so the line causing the error is `dom->loadHTML(apply_filters('the_content', $post->post_content));` – David Gard Feb 01 '13 at 14:29
  • 1
    Line 10 of the content you're parsing... – lonesomeday Feb 01 '13 at 14:40
  • Ok, with you. In one case, it's `James Cooper, Stnsted Oark Foundation;`. I did think it could be the `;` causing the issue, but rempving them all (there were several before) didn't help. – David Gard Feb 01 '13 at 14:43
  • "I can post some example of what post->post_content contains if that would be helpful?". Yeah definitely! Not an example though, I want the exact HTML that is generating the error. – Shoe Feb 01 '13 at 14:44
  • Have updated for you. Thanks. – David Gard Feb 01 '13 at 14:48
  • 13
    @DavidGard My best guess then is that there is an unescaped ampersand (`&`) somewhere in the HTML. This will make the parser think we're in an entity reference (e.g. `©`). When it gets to `;`, it thinks the entity is over. It then realises what it has doesn't conform to an entity, so it sends out a warning and returns the content as plain text. – lonesomeday Feb 01 '13 at 14:49
  • Ok, that makes sense. And `&` is on line 10 form the looks of it. Will do some testing to fix and see what occurs... Thanks. – David Gard Feb 01 '13 at 14:53
  • Beautiful, that was indeed the problem. I will accept as soon as you post as an answer. Thanks for the help. – David Gard Feb 01 '13 at 14:56
  • might want to phrase the question in the form of a question. better jeopardy payback. – adam rowe May 04 '17 at 14:05

9 Answers9

43

This correct answer comes from a comment from @lonesomeday.

My best guess then is that there is an unescaped ampersand (&) somewhere in the HTML. This will make the parser think we're in an entity reference (e.g. ©). When it gets to ;, it thinks the entity is over. It then realises what it has doesn't conform to an entity, so it sends out a warning and returns the content as plain text.

David Gard
  • 11,225
  • 36
  • 115
  • 227
  • 24
    So how do I fix it? I cant call htmlentities on whole html string. – MavWolverine Oct 09 '13 at 02:05
  • 8
    @MavWolverine I know this is many years later, but I just stubbled into this same issue. The simplest option I found was just to do a string replace `str_replace(' & ', ' & ', $string)` as `htmlentities` and `htmlspecialcharacters` caused the `<` and `>` of the HTML tags to be converted. Now I am 100% sure there is a better way to do this, but that sorted what I needed on a simple one off parse job. – PanPipes Feb 06 '20 at 10:22
  • 4
    @PanPipes a little more restrictive: `preg_replace("/&(?!\S+;)/", "&", $string)`. – kagmole Nov 18 '20 at 09:52
  • 2
    This saves my day, I was struggling and later on finds that the contents generated by a user include & in a name and that was a source of all errors. Thanks – Ephra Aug 04 '21 at 19:19
25

As mentionned here

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,

you can use :

libxml_use_internal_errors(true);

see http://php.net/manual/en/function.libxml-use-internal-errors.php

Community
  • 1
  • 1
Ka.
  • 1,189
  • 1
  • 12
  • 18
2

Check "&" character in your HTML code anywhere.I had that issue because of that scenario.

Dhana
  • 51
  • 2
2

An unescaped "&" somewhere in the HTML and replace "&" with &amp. Here is my solution!

 $html = preg_replace('/&(?!amp)/', '&amp;', $html);

It will replace the single ampersand with "&amp" but current "&amp" will still remain the same.

Nay
  • 1,223
  • 2
  • 9
  • 18
1

I don't have the reputation required to leave a comment above, but using htmlspecialchars solved this problem in my case:

$inputHTML = htmlspecialchars($post->post_content);
$dom = new DOMDocument();
$dom->loadHTML(apply_filters('the_content', $inputHTML)); // Have tried stripping all tags but <img>, still generates warning
$nodes = $dom->getElementsByTagName('img');
foreach($nodes as $img) :
    $images[] = $img->getAttribute('src');
endforeach;

For my purposes, I'm also using strip_tags($inputHTML, "<strong><em><br>"), so all image tags are stripped out as well - I'm not sure if this would be a problem otherwise.

Good Idea
  • 2,481
  • 3
  • 18
  • 25
0

I eventually solved this problem the right way, using tidy

// Configuration
$config = array(
    'indent'         => true,
    'output-xhtml'   => true,
    'wrap'           => 200);

// Tidy to avoid errors during load html
$tidy = new tidy;
$tidy->parseString($bill->bill_text, $config, 'utf8');
$tidy->cleanRepair();

$domDocument = new DOMDocument();
$domDocument->loadHTML(mb_convert_encoding($tidy, 'HTML-ENTITIES', 'UTF-8'));
0

For laravel,

Use {{ }} instead of {!! !!}

I faced this and I managed to solved it.

Apit John Ismail
  • 2,047
  • 20
  • 19
0

I found there was an error in my table tags. There was an extra </td> that I removed and bingo.

user3251285
  • 153
  • 1
  • 8
-8

just replace "&" with "and" in your string. do that for all the other symbols

Mike
  • 3
  • 1
    No, that's a terrible suggestion. The use of `&` is for a specific purpose, and simply replacing it with `and` doesn't conform in most cases. Company names are one obvious example. – David Gard Feb 06 '14 at 08:59