2

I need to send an XML document to a SOAP web service (which I don't have any control of). I was receiving an error because the texts contain html entities, so I clean the strings of text with html_entity_decode() and then htmlspecialchars() before I add the text to the simpleXML object, like this:

if( !mb_detect_encoding($string, "UTF-8") == "UTF-8" ) {
   $string = utf8_encode($string);
}
$string = htmlspecialchars( html_entity_decode($string, ENT_COMPAT, 'UTF-8'), ENT_COMPAT, 'UTF-8');
$xml->addChild('PROD_DESC', $string);

But although it cleans named entities in the form © it doesn't do anything with hexadecimal entities like á, and the service I am talking to doesn't accept those either.

In this post I found a possible solution, but when I pass that string to the tidy cleanString function I get the same string, it doesn't touch those entities either.

Community
  • 1
  • 1
AJJ
  • 7,365
  • 7
  • 31
  • 34
  • 2
    possible duplicate of [php: using DomDocument whenever I try to write UTF-8 it writes the hexadecimal notation of it.](http://stackoverflow.com/questions/3575109/php-using-domdocument-whenever-i-try-to-write-utf-8-it-writes-the-hexadecimal-no) – Gordon Jan 20 '11 at 15:59
  • 1
    Yes, sorry, I hadn't seen that one. You gave a good explanation there. – AJJ Jan 20 '11 at 16:19

2 Answers2

2

The numeric entities are added by SimpleXML because your XML document has no declared encoding:

// with declared encoding :
$xml = simplexml_load_string('<?xml version="1.0" encoding="utf-8"?><x></x>');
$xml->addChild('PROD_DESC', "à");
// result: <PROD_DESC>à</PROD_DESC>

// without declared encoding :
$xml = simplexml_load_string('<?xml version="1.0"?><x></x>');
$xml->addChild('PROD_DESC', "à");
// result: <PROD_DESC>&#xE0;</PROD_DESC>
Álvaro González
  • 142,137
  • 41
  • 261
  • 360
Arnaud Le Blanc
  • 98,321
  • 23
  • 206
  • 194
  • This works! Thanks. Just one piece left: All the entities are gone except the carriage return: at the end of every line in those text fields. Why are these entities being inserted? – AJJ Jan 20 '11 at 16:16
0

Is it acceptable for you to pass the string as base64 encoded data? This would eliminate the need to strip anything out.

horatio
  • 1,426
  • 8
  • 7