Flash CS4/AS3: differing behavior between console and textarea for printing UTF-16 characters

Question

trace(escape("д"));

will print "%D0%B4", the correct URL encoding for this character (Cyrillic equivalent of "A").

However, if I were to do..

myTextArea.htmlText += unescape("%D0%B4");

What gets printed is:

Ð´

which is of course incorrect. Simply tracing the above unescape returns the correct Cyrillic character, though! For this texarea, escaping "д" returns its unicode code-point "%u0434".

I'm not sure what exactly is happening to mess this up, but...

UTF-16 Ð´ in web encoding is: %FE%FF%00%D0%00%B4

Whereas

UTF-16 д in web encoding is: %00%D0%00%B4

So it's padding this value with something at the beginning. Why would a trace provide different text than a print to an (empty) textarea? What's goin' on?

The textarea in question has no weird encoding properties attached to it, if that sort of thing is even possible.

score 4 · Accepted Answer · answered Mar 31 '11 at 02:42

The problem is unescape (escape could also be a problem, but it's not the culprit in this case). These functions are not multibyte aware. What escape does is this: it takes a byte in the input string and returns its hex representation with a % prepended. unescape does the opposite. The key point here is that they work with bytes, not characters.

What you want is encodeURIComponent / decodeURIComponent. Both use utf-8 as the string encoding scheme (the encoding using by flash everywhere). Note that it's not utf-16 (which you shouldn't care about as long as flash is concerned).

encodeURIComponent("д"); //%D0%B4
decodeURIComponent("%D0%B4"); // д

Now, if you want to dig a bit deeper, here's what's going on (this assumes a basic knowledge of how utf-8 works).

escape("д")

This returns

%D0%B4

Why?

"д" is treated by flash as utf-8. The codepoint for this character is 0x0434.

In binary:

0000 0100 0011 0100

It fits in two utf-8 bytes, so it's encoded thus (where e means encoding bit, and p means payload bit):

1101 0000 1011 0100
eeep pppp eepp pppp

Converting it to hex, we get:

0xd0  0xb4

So, 0xd0,0xb4 is a utf-8 encoded "д".

This is fed to escape. escape sees two bytes, and gives you:

%d0%b4

Now, you pass this to unescape. But unescape is a little bit brain-dead, so it thinks one byte is one and the same thing as one char, always. As far as unescape is concerned, you have two bytes, hence, you have two chars. If you look up the code-points for 0xd0 and 0xb4, you'll see this:

0xd0 -> Ð
0xb4 -> ´

So, unescape returns a string consisting of two chars, Ð and ´ (instead of figuring out that the two bytes it got where actually just one char, utf-8 encoded). Then, when you assign the text property, you are not really passing д´ butÐ´`, and this is what you see in the text area.

While this answers the above question, I'm finding that encodeURIComponent is the issue rather than escape. Escaping the Cyrillic character gives me "u0434%", which is printed correctly. encodeURIComponent gives me that %d0%b4, which is interpreted as two characters like you said. What can be done if encodeURIComponent's output is interpreted incorrectly like this? — Alkanshel, Mar 31 '11 at 17:33
@Amalgovinus. I'm not sure how this situation should be dealt with (also I don't know what program / server / etc is interpreting these characters). I only have had encoding related problems with Spanish characters. All Spanish characters have code-points below 256, so they are either not encoded but sent as a single byte (ISO-8859-1) or they are encoded to utf-8 ("ñ" for instance is %f1 or %c3%b1). I'm not sure how you are supposed to send the data if the receiving program doesn't understand utf-8 or iso-8859-1. Have you tried utf-16? If it works maybe you'll have to do the encoding yourself. — Juan Pablo Califano, Mar 31 '11 at 18:37
Turns out the ultimate culprit was how we were retrieving the data remotely with a URLLoader, then using URLDecode on it. Apparentlly URLDecode can't handle UTF-16 and interprets it as UTF-8. The solution seems to be using decodeURI instead (and changing a few other things). Thank you for your time. Also I meant to edit the question and not your answer, sorry bout that — Alkanshel, Mar 31 '11 at 21:33
errr... not 'URLDecode', unescape. XD Yeah, unescape is bunk. Came full circle. — Alkanshel, Mar 31 '11 at 22:10
@Amalgovinus. No problem. I'm glad you find a solution. Since you are using a URLLoader, in the worst possible case you could load the data as binary and do all the necessary decoding yourself (but it seems you don't need to do that since you've managed to get it working) — Juan Pablo Califano, Apr 01 '11 at 01:28

Flash CS4/AS3: differing behavior between console and textarea for printing UTF-16 characters

1 Answers1