How to parse Simple HTML DOM with ampersand (etc) character errors

Question

There are four or five questions on SO that address this specific issue (an example); however they are quite aged (+10 years) and none of them adequately address the issue with specifics. I'm hoping that answers to this question might both address my specific issue while clearing up the confusion for the community at the same time.

I am trying to parse a client's site, to build a summary of current content for their IT department. (Please don't ask me why they can't do this themselves.)

In the past I have used the PHP library Simple HTML DOM Parser to do tasks such as this. I have not used this library for about seven years, but I've never run into this issue before.

When loading the document to an object using…

$dom = new DOMDocument('1.0','UTF-8');
$dom->loadHTMLFile($url); // run WITH error output

…PHP returns warnings along this line:

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in https://thehtmlfilename.html, line: 45 in /myScript/index.php on line 47

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: no name in https://thehtmlfilename.html, line: 88 in /myScript/index.php on line 47

These warnings neither seem to prevent the loading of the DOM, nor do they stop the script from running. However, when I attempt to access the href group using $anchors = $dom->getElementsByTagName('a');, the script will run through the first three or four (well-constructed) hrefs, then meet a line like these:

<li class="">
    <a href="https://www.thecompany.com/campus_staff.html">Campus & Staff</a>
</li>
<li class="">
    <a href="https://www.thecompany.com/parents-and-families.html">Family & Friends</a>
</li>

Careful analysis determines that it is lines like these that produce the warnings above. Both of these lines produce the "expecting ';'" warning.

When I var_dump the $anchors object, all that is returned is this:

object(DOMNodeList)#2 (1) {
  ["length"]=>
  int(90)
}

Other answers, such as the linked question above, mention

My best guess then is that there is an unescaped ampersand (&) somewhere in the HTML. This will make the parser think we're in an entity reference (e.g. ©). When it gets to ;, it thinks the entity is over. It then realises what it has doesn't conform to an entity, so it sends out a warning and returns the content as plain text.

Which suggests that I am on the right track.

Various resolutions that have been suggested all prescribe the changing of the & to a non-& character using various means: str_replace, pre_replace, htmlentities, &tc.

I understand a contradiction in these answers. The & character seems to be interrupting the loading process that is initiated by loadHTMLFile() and which creates the DOM object. If that is the case, the programmer has no ability to replace the & character prior to processing.

How then? It's a great step forward to identify the problem, as in the linked questions; but how do we solve that problem? How do we pull these href links from this page?

It's worth noting that the ampersand that we find in…

<a href="https://www.thecompany.com/campus_staff.html">Campus & Staff</a>

… is not in the href itself, but in the link text (between the <a> tags).

Could you load the html as a simple string, replace them there, then load via `loadHtml()`? — msmahon, Dec 06 '22 at 16:14
@msmahon Do you mean something like ```file_get_contents()```, then Simple HTML DOM's ```$html->load()``` on that string? — Parapluie, Dec 06 '22 at 16:22
Kind of. `$html = file_get_contents(...)` then perform a replace on the string (`$html = preg_replace('/&(?!amp)/', '&', $html)`), then load using `$domObject->loadHtml($html)`. — msmahon, Dec 06 '22 at 16:24
I think that we're almost there. When I $html->load(), I 'm getting an empty object. Rather it is an object with all of the pointers, all pointing to empty data. When I try to access the ```a``` hrefs, I get ```object(DOMNodeList)#2 (1) { ["length"]=> int(0) }```. Working on it… — Parapluie, Dec 06 '22 at 16:41
Are you using `load()` or `loadHtml()`? Once you parse them you can loop through the length of the list fetching anchor data like so: `$anchors->item(0)->firstChild->wholeText`. Edit: `load` might be just fine if it returns a DOMNodeList. — msmahon, Dec 06 '22 at 17:07
@msmahon This is on the right track but I want to flesh this out a bit. Hang in there! I will make sure credit goes where credit is due! :-) — Parapluie, Dec 06 '22 at 19:58

msmahon · Accepted Answer · 2022-12-06T18:59:12.190

1

Fetch the content as a string first, then replace the ampersand instances with something parsable.

$html = file_get_contents('/path/to/file.html');
$html = preg_replace('/&(?=\s)/', '&amp;', $html);
$doc = new DOMDocument();
$doc->loadHTML($html);
$anchors = $doc->getElementsByTagName('a');
foreach ($anchors as $anchor) {
  print $anchor->firstChild->wholeText;
}

edited Dec 06 '22 at 18:59

answered Dec 06 '22 at 17:13

msmahon

453
2
11

`/&(?!amp)/` — That assumes that the document contains **no** HTML entities at all. It will create more problems is that assumption isn't true. – Quentin Dec 06 '22 at 17:22
@Quentin I ignorantly lifted that pattern from a similar issue, so thank you for pointing that out. Perhaps `&(?=\s)` would be more reasonable since it is very unlikely there is any text besides an entity that would have characters directly following an ampersand. – msmahon Dec 06 '22 at 18:58
I actually worked this out another way—due to monstrously poor construction of this page. Some code I built in 2014 seemed to do the trick. However, the logic of msmahon's approach is solid. I'm giving it the thumbs-up. Thanks to all who helped here. – Parapluie Dec 08 '22 at 19:56

How to parse Simple HTML DOM with ampersand (etc) character errors

1 Answers1