2

Using PHP 5.3.13

simplexml_load_string throws the Entity 'divide' not defined error when parsing xml.

Most solutions to this issue focus on how to deal with the SimpleXMLElement and its addChild and addAttribue methods. Those methods convert some special characters into entities. The solution then seems to say to deal with the few special characters that simplexml_load_string does not understand.

Problem is that list is very large and if you use htmlentities($string, ENT_QUOTES, 'UTF-8', true) on the $string you are about to insert into the XML addChild then it will add fine but the simplexml_load_string will crash on trying to parse that generated XML from as_xml generated by SimpleXMLElement.

Another issue is the list of entities that are generated maybe long but users can just as easily type the following &pizza; and it would break the parser. Since I need to handle all user input, I came up with the following but want to know if you see any cases it will fail.

Want to know if the following solution works: replace the & anywhere in the string with &. I have been unable to find a case were my solution breaks but since it is so simple and I have not seen it as a solution listed

  1. Rationale behind SimpleXMLElement's handling of text values in addChild and addAttribute - On this issue but does not solve the general issue
  2. XML parser error: entity not defined - Addressing just a few special characters

Here is some sample code for my possible solution:

$content_amp_safe = str_replace('&','&',$content);

Here are the error messages:

Warning: simplexml_load_string(): Entity: line 11: parser error : internal error Entity 'divide' not defined

Here is code that would cause the problem pre-solution:

<?php
// insert that causes the issue with the windows encoded dash, triple dot, and right quote as an example
// also issue if user enters &pizza; in the text as it is an unknown entity
$content = "I love &pizza; in the … morning's  – night as well";
$content_unsafe = htmlentities($content, ENT_QUOTES, 'UTF-8', true);
//fix is to use $content_amp_safe string instead
$content_amp_safe = str_replace('&','&amp;',$content_unsafe);
$xml = new SimpleXMLElement("<?xml version='1.0' encoding='utf-8'?><Entries />");
$entry = $xml->addChild('Entry');
$entry->addChild('Content', $content);
$xml_string = $xml->asXML();
libxml_use_internal_errors(true);
$xml = simplexml_load_string($xml_string);
if ($xml === false) {
    $error_string = "Failed loading XML\n";
    foreach ( libxml_get_errors() as $error ) {
        $error_string .= "\t" . $error->message;
    }
    echo $error_string;
}
libxml_use_internal_errors(false);
?>

The short version of some of the characters that cause issues using htmlentities on user input.

<?php 
 $table = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES, 'cp1252');
 var_dump($table);
?>

Example Characters:

€‚ƒ„…†‡ˆ‰Š‹Œ‘’“”•–—˜™š›œŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ&"'<>

Example Encoding:

€‚ƒ„…†‡ˆ‰Š‹Œ‘’“”•–—˜™š›œŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ&"'<>

Example New Encoding:

&euro;&sbquo;&fnof;&bdquo;&hellip;&dagger;&Dagger;&circ;&permil;&Scaron;&lsaquo;&OElig;&lsquo;&rsquo;&ldquo;&rdquo;&bull;&ndash;&mdash;&tilde;&trade;&scaron;&rsaquo;&oelig;&Yuml;&nbsp;&iexcl;&cent;&pound;&curren;&yen;&brvbar;&sect;&uml;&copy;&ordf;&laquo;&not;&shy;&reg;&macr;&deg;&plusmn;&sup2;&sup3;&acute;&micro;&para;&middot;&cedil;&sup1;&ordm;&raquo;&frac14;&frac12;&frac34;&iquest;&Agrave;&Aacute;&Acirc;&Atilde;&Auml;&Aring;&AElig;&Ccedil;&Egrave;&Eacute;&Ecirc;&Euml;&Igrave;&Iacute;&Icirc;&Iuml;&ETH;&Ntilde;&Ograve;&Oacute;&Ocirc;&Otilde;&Ouml;&times;&Oslash;&Ugrave;&Uacute;&Ucirc;&Uuml;&Yacute;&THORN;&szlig;&agrave;&aacute;&acirc;&atilde;&auml;&aring;&aelig;&ccedil;&egrave;&eacute;&ecirc;&euml;&igrave;&iacute;&icirc;&iuml;&eth;&ntilde;&ograve;&oacute;&ocirc;&otilde;&ouml;&divide;&oslash;&ugrave;&uacute;&ucirc;&uuml;&yacute;&thorn;&yuml;&amp;&quot;&#039;&lt;&gt;

Serg
  • 2,346
  • 3
  • 29
  • 38
techphd
  • 31
  • 2
  • 5
  • The two Q&A entries you refer to in your question are pretty good ones on the issue at hand. I wonder a bit, you seem to be a bit unsure with your conclusion. I don't know why. Encoding "`&`" as "`&`" is exactly necessary when creating XML. This is also what **SimpleXMLElement** does (when you use property access to change text content of an element node). – hakre Mar 06 '15 at 23:28
  • Thanks @hakre. I felts pretty good about my solution. I was more surprised no one else had run into this issue as it should be pretty common. Wanted to make sure I was not re-inventing the wheel or incorrectly using the functions. – techphd Mar 08 '15 at 00:35
  • From your answer below, it appears that doing the assignment **$entry->addChild('Content')->{0} = $content** will result in all & characters being converted to & without the need for a pre-processing function. So, you are saying **$entry->addChild('Content', $content);** should only be used with text that is known not to contain any entities (aka &pizza;). I'll might switch to your syntax to keep it simpler, thanks again. – techphd Mar 08 '15 at 00:44
  • Yes reg. the answer, but no reg. `addChild()`, you use `addChild()` if you want to insert those entites verbatim. `&pizza;` can be valid and that method allows you to insert it that way. – hakre Mar 08 '15 at 18:58

1 Answers1

2

Your observation is correct that SimpleXMLElement::addChild() (and ::addAttribute()) convert (only) some special characters into entities.

This is to enter some characters there verbatim (especially the ampersand "&" character).

However you don't want to have it that way in your case. To convert all special characters, you need set the text-value of an XML element via property access, for example:

$entry->Content = $content;

As you can see, $entry->addChild('Content', $content) isn't used, instead the property access $entry->Content. That property access only works if you insert a single Content element. If you want to insert more than one to the same parent, you have to use a so called simplexml-self-reference. A demonstration now with addChild() again:

$entry->addChild('Content')->{0} = $content;

An example in full:

$content = "I love &pizza; in the … morning's  – night as well";

$xml = new SimpleXMLElement("<Entries />");
$entry = $xml->addChild('Entry');
$entry->Content = $content;
$entry->addChild('Content')->{0} = $content;

echo $xml->asXML();

Output (beautified):

<?xml version="1.0"?>
<Entries>
  <Entry>
    <Content>I love &amp;pizza; in the … morning's  – night as well</Content>
    <Content>I love &amp;pizza; in the … morning's  – night as well</Content>
  </Entry>
</Entries>

I hope it's not too confusing for the moment.

Next to the problem you have with the ampersand, you might see some character encoding issues. For those there is one simple rule: Whenever you pass a string to SimpleXMLElement, the encoding of that string must be UTF-8.

So if you get data from a HTML form from your website, take care the browser sends such data UTF-8 encoded -or- re-encode the data into UTF-8 before passing it to the SimpleXMLElement.

hakre
  • 193,403
  • 52
  • 435
  • 836