-1

PHP's DOMDocument class messes up UTF-8 input unless you prepare your input first.

For example, this code

<?php
echo mb_internal_encoding()."\n\n";

$str = '’';
$dom = new DOMDocument;
$dom->loadHTML($str);
echo $dom->saveHTML();

produces this output

UTF-8
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>&acirc;&#128;&#153;</p></body></html>

&acirc;&#128;&#153; should be &rsquo;.

I want to know all the character entities, like &acirc;, that DOMDocument may produce if you don't use the fix. Is there a list somewhere? Is it in the PHP source code? LibXML source code?

Community
  • 1
  • 1
David Winiecki
  • 4,093
  • 2
  • 37
  • 39

1 Answers1

0

I thought of a way to find out without reading any references or source code:

<?php

$str = '';

for ($i = 1; $i < 256; $i++) {

   $str .= chr($i)."\n";
}

$str .= chr(0)."\n";

$dom = new DOMDocument;
$dom->loadHTML($str);
echo $dom->saveHTML();

If you need a correct list then I recommend running that on your own system to get your own list, in case it is different in different versions of PHP etc.

Expect a lot of warning messages, but no errors.

Here's the output I get, except I removed non-character entities with a text editor:

&amp;
&#128;
&#129;
&#130;
&#131;
&#132;
&#133;
&#134;
&#135;
&#136;
&#137;
&#138;
&#139;
&#140;
&#141;
&#142;
&#143;
&#144;
&#145;
&#146;
&#147;
&#148;
&#149;
&#150;
&#151;
&#152;
&#153;
&#154;
&#155;
&#156;
&#157;
&#158;
&#159;
&nbsp;
&iexcl;
&cent;
&pound;
&curren;
&yen;
&brvbar;
&sect;
&uml;
&copy;
&ordf;
&laquo;
&not;
&shy;
&reg;
&macr;
&deg;
&plusmn;
&sup2;
&sup3;
&acute;
&micro;
&para;
&middot;
&cedil;
&sup1;
&ordm;
&raquo;
&frac14;
&frac12;
&frac34;
&iquest;
&Agrave;
&Aacute;
&Acirc;
&Atilde;
&Auml;
&Aring;
&AElig;
&Ccedil;
&Egrave;
&Eacute;
&Ecirc;
&Euml;
&Igrave;
&Iacute;
&Icirc;
&Iuml;
&ETH;
&Ntilde;
&Ograve;
&Oacute;
&Ocirc;
&Otilde;
&Ouml;
&times;
&Oslash;
&Ugrave;
&Uacute;
&Ucirc;
&Uuml;
&Yacute;
&THORN;
&szlig;
&agrave;
&aacute;
&acirc;
&atilde;
&auml;
&aring;
&aelig;
&ccedil;
&egrave;
&eacute;
&ecirc;
&euml;
&igrave;
&iacute;
&icirc;
&iuml;
&eth;
&ntilde;
&ograve;
&oacute;
&ocirc;
&otilde;
&ouml;
&divide;
&oslash;
&ugrave;
&uacute;
&ucirc;
&uuml;
&yacute;
&thorn;
&yuml;
David Winiecki
  • 4,093
  • 2
  • 37
  • 39