0

I'm using the code as bellow to get the wanted content form HTML by DOMDocument,

$subject = 'some html code';
$doc = new DOMDocument('1.0');                   
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
$docSave = new DOMDocument('1.0');
foreach ( $result as $node ) {
    $domNode = $docSave->importNode($node, true);
    $docSave->appendChild($domNode);
}
echo $docSave->saveHTML();

The problem is that if there is a spcial character in HTML $subject like space or new line then it is converted to html entitle. Input HTML is far away form being in good style and some special characters are also within paths in tags, for instance:

$subject = '<div><a href='http://www.site.com/test.php?a=1&b=2, 3, 
4'></a></div>';

will produce:

<div><a href='http://www.site.com/test.php?a=1&b=2,%203,%0A%204'></a></div>

instead of:

<div><a href='http://www.site.com/test.php?a=1&b=2, 3, 
    4'></a></div>'

What one can do to omit conversion of special characters to their entities if wants to keep the invalid html?

I tried do set this flag substituteEntities to false but I got no improvement, maybe I used it wrong? some examples of code would be very helpful.

Ben Swinburne
  • 25,669
  • 10
  • 69
  • 108
Jimmix
  • 5,644
  • 6
  • 44
  • 71
  • I think they are perfectly fine. Both urls are valid and same. – Shiplu Mokaddim Feb 04 '12 at 17:05
  • These aren't HTML entities. They are URL-specific escapes. And at least the PHP frontend for libxml [does not provide any option](http://php.net/manual/en/libxml.constants.php) to influence this normalization. – mario Feb 04 '12 at 17:08
  • [Spaces and line breaks are actually invalid in URLs.](http://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid/1547940#1547940) It’s just the tolerance of browsers (or DOMDocument) that handle appropriately to encode them properly. – Gumbo Feb 04 '12 at 17:09
  • Thank you for correcting me, true - there are not entities. The aim of the extraction is to keep the source code unchaged even if it is wrong. I strongly agree that there should be no such chars like new line or space in urls but, i have to keep original type. – Jimmix Feb 04 '12 at 17:17

1 Answers1

2

You can't use a parser and be able to manipulate the bad HTML. A parser would clean up the HTML in order to parse it.

If you absolutely must use the bad HTML, use regexes but be aware that there is an extreme risk of head injury as you will either be -brick'd- or bang your head against the desk too much.

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592