Handling unicode in the http response xml

Question

I'm writing a Google Chrome extension that builds upon myanimelist.net REST api. Sometimes the XMLHttpRequest response text contains unicode.

For example:

<title>Onegai My Melody Sukkiri&acirc;�&ordf;</title>

If I create a HTML node from the text it looks like this:

Onegai My Melody Sukkiriâ�ª

The actual title, however, is this:

Onegai My Melody Sukkiri♪

Why is my text not correctly rendered and how can I fix it?

Update

Code: background.html

I think these are the crucial parts:

function htmlDecode(input){
  var e = document.createElement('div');
  e.innerHTML = input;
  return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}

function xmlDecode(input){
  var result = input;
  result = result.replace(/</g,  "&lt;");
  result = result.replace(/>/g,  "&gt;");
  result = result.replace(/\n/g, "&#10;");
  return htmlDecode(result);
}

Further:

var parser = new DOMParser();
var xmlText = response.value;
var doc = parser.parseFromString(xmlDecode(xmlText), "text/xml");

Are you getting this response text back from a PHP web service? — Nightfirecat, Aug 15 '11 at 20:05
@Nightfirecat I think so because the website is created in PHP. — StackedCrooked, Aug 15 '11 at 20:15

score 2 · Accepted Answer · answered Aug 16 '11 at 21:02

<title>Onegai My Melody Sukkiri&acirc;�&ordf;</title>

Oh dear! Not only is that the wrong text, it's not even well-formed XML. acirc and ordf are HTML entities which are not predefined in XML, and then there's an invalid UTF-8 sequence (one high byte, presumably originally 0x99) between them.

The problem is that myanimelist are generating their output ‘XML’ (but “if it ain't well-formed, it ain't XML”) using the PHP function htmlentities(). This tries to HTML-escape not only the potentially-sensitive-in-HTML characters <&"', but also all non-ASCII characters.

This generates the wrong characters because PHP defaults to treating the input to htmlentities() as ISO-8859-1 instead of UTF-8 which is the encoding they're actually using. But it was the wrong thing to begin with because the HTML entity set doesn't exist in XML. What they really wanted to use was htmlspecialchars(), which leaves the non-ASCII characters alone, only escaping the really sensitive ones. Because those are the same ones that are sensitive in XML, htmlspecialchars() works just as well for XML as HTML.

htmlentities() is almost always the Wrong Thing; htmlspecialchars() should typically be used instead. The one place you might want to encode non-ASCII bytes to entity references would be when you're targeting pure ASCII output. But even then htmlentities() fails because it doesn't make character references (&#...;) for the characters that don't have a predefined entity names. Pretty useless.

Anyway, you can't really recover the mangled data from this. The � represents a byte sequence that was UTF-8-undecodable to the XMLHttpRequest, so that information is irretrievably lost. You will have to persuade myanimelist to fix their broken XML output as per the above couple of paragraphs before you can go any further.

Also they should be returning it as Content-Type: text/xml not text/html as at the moment. Then you could pick up the responseXML directly from the XMLHttpRequest object instead of messing about with DOMParsers.

score 1 · Answer 2 · answered Aug 15 '11 at 20:29

So, I've come across something similar to what's going on here at work, and I did a bit more research to confirm my hypothesis.

If you take a look at the returned value you posted above, you'll notice the tell-tell entity "â". 99% of the time when you see this entity, if means you have a character encoding issue (typically UTF-8 characters are being encoded as ISO-8859-1).

The first thing I would test for is to force a character encoding in the API return. (It's a long shot, but you could look)

Second, I'd try to force a character encoding onto the data returned (I know there's a .htaccess override, but I don't know what's allowed in Chrome extensions so you'll have to research that).

What I believe is going on, is when you crate the node with the data, you don't have a character encoding set on the document, and browsers (typically, in my experience) default to ISO-8859-1. So, check to make sure it's not your document that's the problem.

Finally, if you can't find the source (or can't prevent it) of the character encoding, you'll have to write a conversation table to replace the malformed values you're getting with the ones you want { JS' "replace" should be fine (http://www.w3schools.com/jsref/jsref_replace.asp) }.

score -1 · Answer 3 · edited May 23 '17 at 10:34

-1

You can't just use a simple search and replace to fix encoding issue since they are unicode, not characters typed on a keyboard.

Your data must be stored on the server in UTF-8 format if you are planning on retrieving it via AJAX. This problem is probably due to someone pasting in characters from MS-Word which use a completely different encoding scheme (ISO-8859).

If you can't fix the data, you're kinda screwed.

For more details, see: UTF-8 vs. Unicode

edited May 23 '17 at 10:34

Community

1
1

answered Aug 15 '11 at 20:23

Diodeus - James MacFarlane

112,730
33
157
176

Ok, let's call them "characters typed on a keyboard" since people don't type in two-byte codes. – Diodeus - James MacFarlane Aug 15 '11 at 20:29
1

What do you mean “two-byte codes”? First of all, Unicode code points range from 0–0x10FFFF. How is that a “two-byte” code? Secondly, I can type a lot on my keyboard: ¡™£¢∞§¶•ªº–≠⁄€‹›ﬁﬂ‡°·‚—œ∑´®†¥¨ˆøπ“Œ„´‰ˇÁ¨ˆØ∏”åß∂ƒ©˙∆˚¬…ÅÍÎÏ˝ÓÔÒÚΩ≈ç√∫˜µ≤≥÷¸˛Ç◊ı˜Â¯˘¿p. Unicode doesn’t have bytes, it has abstract code points. Perhaps you’re thinking of some encoding form. Again, I can type lots of things on my keybaord just fine — ¿can’t you? I can even type arbitrary code points and compose extended grapheme clusters with it. Sound to be like you need a new keyboard. I sugggest an Apple. – tchrist Aug 15 '11 at 22:25

Handling unicode in the http response xml

Update

3 Answers3