Using PHP 5.3.13
simplexml_load_string throws the Entity 'divide' not defined error when parsing xml.
Most solutions to this issue focus on how to deal with the SimpleXMLElement and its addChild and addAttribue methods. Those methods convert some special characters into entities. The solution then seems to say to deal with the few special characters that simplexml_load_string does not understand.
Problem is that list is very large and if you use htmlentities($string, ENT_QUOTES, 'UTF-8', true) on the $string you are about to insert into the XML addChild then it will add fine but the simplexml_load_string will crash on trying to parse that generated XML from as_xml generated by SimpleXMLElement.
Another issue is the list of entities that are generated maybe long but users can just as easily type the following &pizza; and it would break the parser. Since I need to handle all user input, I came up with the following but want to know if you see any cases it will fail.
Want to know if the following solution works: replace the & anywhere in the string with &. I have been unable to find a case were my solution breaks but since it is so simple and I have not seen it as a solution listed
- Rationale behind SimpleXMLElement's handling of text values in addChild and addAttribute - On this issue but does not solve the general issue
- XML parser error: entity not defined - Addressing just a few special characters
Here is some sample code for my possible solution:
$content_amp_safe = str_replace('&','&',$content);
Here are the error messages:
Warning: simplexml_load_string(): Entity: line 11: parser error : internal error Entity 'divide' not defined
Here is code that would cause the problem pre-solution:
<?php
// insert that causes the issue with the windows encoded dash, triple dot, and right quote as an example
// also issue if user enters &pizza; in the text as it is an unknown entity
$content = "I love &pizza; in the … morning's – night as well";
$content_unsafe = htmlentities($content, ENT_QUOTES, 'UTF-8', true);
//fix is to use $content_amp_safe string instead
$content_amp_safe = str_replace('&','&',$content_unsafe);
$xml = new SimpleXMLElement("<?xml version='1.0' encoding='utf-8'?><Entries />");
$entry = $xml->addChild('Entry');
$entry->addChild('Content', $content);
$xml_string = $xml->asXML();
libxml_use_internal_errors(true);
$xml = simplexml_load_string($xml_string);
if ($xml === false) {
$error_string = "Failed loading XML\n";
foreach ( libxml_get_errors() as $error ) {
$error_string .= "\t" . $error->message;
}
echo $error_string;
}
libxml_use_internal_errors(false);
?>
The short version of some of the characters that cause issues using htmlentities on user input.
<?php
$table = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES, 'cp1252');
var_dump($table);
?>
Example Characters:
€‚ƒ„…†‡ˆ‰Š‹Œ‘’“”•–—˜™š›œŸ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ&"'<>
Example Encoding:
€‚ƒ„…†‡ˆ‰Š‹Œ‘’“”•–—˜™š›œŸ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ&"'<>
Example New Encoding:
€‚ƒ„…†‡ˆ‰Š‹Œ‘’“”•–—˜™š›œŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ&"'<>