4

I try to add a string to an XML object with Simple XML.

Example (http://ideone.com/L4ztum):

 $str = "<aoc> САМОЛЕТОМ ТК Адамант,  г.Домодедово, мкр-н Востряково, Центральный просп. д.12</aoc>";

$movies = new SimpleXMLElement($str);

But it gives a warning:

PHP Warning: SimpleXMLElement::__construct(): Entity: line 1: parser error : PCDATA invalid Char value 2 in /home/nmo2E7/prog.php on line 5

and finally an Exception with the message String could not be parsed as XML.

If I remove two Unicode characters, it works (http://ideone.com/LaMvHN):

$str = "<aoc> САМОЛЕТОМ ТК Адамант,  г.Домодедово, мкр-н Востряково, Центральный просп. д.12</aoc>";
                          ^
                           `-- two invisible characters have been removed here

How can I remove Unicode from string?

hakre
  • 193,403
  • 52
  • 435
  • 836
user1142806
  • 79
  • 1
  • 6
  • See : http://stackoverflow.com/questions/1176904/php-how-to-remove-all-non-printable-characters-in-a-string – CD001 Sep 07 '15 at 15:22

2 Answers2

0

It is not Unicode, but two stray bytes, valued \x01 and \x02. You can filter them out with str_replace:

$s = str_replace("\x01", "", $s);
$s = str_replace("\x02", "", $s);
Bart Friederichs
  • 33,050
  • 15
  • 95
  • 195
0

The constructor of the SimepleXMLElement needs it's first parameter to be well-formed XML.

The string you pass

$str = "<aoc> САМОЛЕТОМ\x02\x01 ТК Адамант,  г.Домодедово, мкр-н Востряково, Центральный просп. д.12</aoc>";

is not well-formed XML because it contains characters out of the character-range of XML, namely:

  • Unicode Character 'START OF TEXT' (U+0002) at binary offset 24
  • Unicode Character 'START OF HEADING' (U+0001) at binary offset 25

So instead of using SimpleXMLElement to create it from a hand-mangled XML-string (which is error-prone), use it to create the XML you're looking for. Let's give an example.

In the following example I assume you've got the text you want to create the XML element of. This example creates an XML element similar to the one in your question with the difference that the exact same string is passed in as text-content for the document element ("<aoc>").

$text     = 'САМОЛЕТОМ ТК Адамант,  г.Домодедово, мкр-н Востряково, Центральный просп. д.12';
$xml      = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><aoc/>');
$xml->{0} = $text; // set the document-element's text-content to $text

When done this way, SimpleXML will filter any invalid control-characters for you and the SimpleXMLElement remains stable:

$str    = $xml->asXML();
$movies = new SimpleXMLElement($str);
print_r($movies);

/* output:

SimpleXMLElement Object
(
    [0] => САМОЛЕТОМ ТК Адамант,  г.Домодедово, мкр-н Востряково, Центральный просп. д.12
)

*/

So to finally answer your question:

How can I remove Unicode from string?

You don't want to remove Unicode from the string. The SimpleXML library accepts Unicode strings only (in the UTF-8 encoding). What you want is that you remove Unicode-characters that are invalid for XML usage. The SimpleXML library does that for you when you set node-values as it has been designed for.

However if you try to load non-well-formed XML via the contructor or the constructor functions (simplexml_load_string etc.), it will fail and give you the (important) error.

I hope this clarifies the situation for you and answers your question.

hakre
  • 193,403
  • 52
  • 435
  • 836