71

I'm getting the error:

parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xED 0x6E 0x2C 0x20

When trying to process an XML response using simplexml_load_string from a 3rd party source. The raw XML response does declare the content type:

<?xml version="1.0" encoding="UTF-8"?>

Yet it seems that the XML is not really UTF-8. The langauge of the XML content is Spanish and contain words like Dublín in the XML.

I'm unable to get the 3rd party to sort out their XML.

How can I pre-process the XML and fix the encoding incompatibilities?

Is there a way to detect the correct encoding for a XML file?

Camsoft
  • 11,718
  • 19
  • 83
  • 120

11 Answers11

80

Your 0xED 0x6E 0x2C 0x20 bytes correspond to "ín, " in ISO-8859-1, so it looks like your content is in ISO-8859-1, not UTF-8. Tell your data provider about it and ask them to fix it, because if it doesn't work for you it probably doesn't work for other people either.

Now there are a few ways to work it around, which you should only use if you cannot load the XML normally. One of them would be to use utf8_encode(). The downside is that if that XML contains both valid UTF-8 and some ISO-8859-1 then the result will contain mojibake. Or you can try to convert the string from UTF-8 to UTF-8 using iconv() or mbstring, and hope they'll fix it for you. (they won't, but you can at least ignore the invalid characters so you can load your XML)

Or you can take the long, long road and validate/fix the sequences by yourself. That will take you a while depending on how familiar you are with UTF-8. Perhaps there are libraries out there that would do that, although I don't know any.

Either way, notify your data provider that they're sending invalid data so that they can fix it.


Here's a partial fix. It will definitely not fix everything, but will fix some of it. Hopefully enough for you to get by until your provider fix their stuff.

function fix_latin1_mangled_with_utf8_maybe_hopefully_most_of_the_time($str)
{
    return preg_replace_callback('#[\\xA1-\\xFF](?![\\x80-\\xBF]{2,})#', 'utf8_encode_callback', $str);
}

function utf8_encode_callback($m)
{
    return utf8_encode($m[0]);
}
Josh Davis
  • 28,400
  • 5
  • 52
  • 67
  • 4
    This is very helpful. I was able to fix the XML by using utf8_encode(). Can you tell me how you deciphered the encoding from the string `0xED 0x6E 0x2C 0x20`? – Camsoft Mar 25 '10 at 10:10
  • 6
    ISO-8859-1 is widely used in the Western world. If it's not UTF-8, it's usually ISO-8859-1. (or cp1252) As for the value of each byte, I just looked up at the char table. – Josh Davis Mar 26 '10 at 06:39
  • To decrypt ASCII, go to https://www.dcode.fr/ascii-code Under "ASCII Converter", put those characters (e.g. ed 6e 2c 20), then the result is shown on left hand side "Results" HEX /2 column – Ivan Chau Jul 09 '22 at 17:40
55

I solved this using

$content = utf8_encode(file_get_contents('http://example.com/rss.xml'));
$xml = simplexml_load_string($content);
Erik
  • 1,086
  • 7
  • 9
  • Worked for me too, in my case the XML didn't declare an encoding and came from one of those "Enterprise" systems so had weird encoding anyway – Erin Drummond Feb 25 '13 at 02:20
  • I had the same issue when using DOMDocument->load(), this solution works fine, just have to use ->loadXML on the result of file_get_contents – Chaoley Dec 13 '14 at 06:07
  • Works for me too! I was receiving files with ANSI characters in an XML file with a UTF-8 encoding. – Cagy79 Jul 08 '15 at 11:52
  • why using $ before content please, i have an error cause of it ? – Mostafa90 Feb 09 '16 at 10:08
19

If you are sure that your xml is encoded in UTF-8 but contains bad characters, you can use this function to correct them :

$content = iconv('UTF-8', 'UTF-8//IGNORE', $content);
Antikhippe
  • 6,316
  • 2
  • 28
  • 43
befox
  • 206
  • 2
  • 4
7

We recently ran into a similar issue and was unable to find anything obvious as the cause. There turned out to be a control character in our string but when we outputted that string to the browser that character was not visible unless we copied the text into an IDE.

We managed to solve our problem thanks to this post and this:

preg_replace('/[\x00-\x1F\x7F]/', '', $input);

Community
  • 1
  • 1
Paul Blundell
  • 1,857
  • 4
  • 22
  • 27
3

Instead of using javascript, you can simply put this line of code after your mysql_connect sentence:

mysql_set_charset('utf8',$connection);

Cheers.

Chango
  • 31
  • 1
2

Can you open the 3rd party XML source in Firefox and see what it auto-detects as encoding? Maybe they are using plain old ISO-8859-1, UTF-16 or something else.

If they declare it to be UTF-8, though, and serve something else, their feed is clearly broken. Working around such a broken feed feels horrible to me (even though sometimes unavoidable, I know).

If it's a simple case like "UTF-8 versus ISO-8859-1", you can also try your luck with mb_detect_encoding().

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • mb_detect_encoding() says the content is UTF-8 yet if it were valid UTF-8 would the XML parser complain about it? – Camsoft Mar 24 '10 at 13:06
  • @Camsoft strange. Can you try it with Firefox? Can you boil it down to the character that creates the problem? Are you at liberty to publish the URL to the XML feed? – Pekka Mar 24 '10 at 13:10
2

If you download XML file and open it for example in Notepad++ you'll see that encoding is set to something else than UTF8 - I'v had the same problem with xml made myself, and it was just te encoding in the editor :)

String <?xml version="1.0" encoding="UTF-8"?> don't set up the encoding of the document, it's only info for validator or another resource.

skr
  • 21
  • 1
1

I just had this problem. Turns out the XML file (not the contents) was not encoded in utf-8, but in ISO-8859-1. You can check this on a Mac with file -I xml_filename.

I used Sublime to change the file encoding to utf-8, and lxml imported it no issues.

paragbaxi
  • 3,965
  • 8
  • 44
  • 58
1

After several tries i found htmlentities function works.

$value = htmlentities($value)
George John
  • 2,629
  • 2
  • 21
  • 16
1

What I was facing was solved by what Erik proposed https://stackoverflow.com/a/4575802/14934277 and it IS, actually, the only way to know if your data is okay to be printed.

And here is some peace of code that could be useful to anyone out there:

$product_desc = ..;
//Filter your $product_desc here. Remove tags, strip, do all you would do to print XML
try{(new SimpleXMLElement('<sth><![CDATA['.$product_desc.']]></sth>'))->asXML();}
catch(Exception $exc) {$product_desc = '';}; //Don't print trash

Note that part.

<![CDATA[]]>

When you try to create an XML out of it, be sure to pass it the final product a browser would see, meaning, having your field wrapped with CDATA

0

When generating mapping files using doctrine I ran into same issue. I fixed it by removing all comments that some fields had in the database.

Tim Lieberman
  • 571
  • 2
  • 5
  • 23