NSString unicode encoding problem

Question

I'm having problems converting the string to something readable . I'm using

NSString *substring = [NSString stringWithUTF8String:[symbol.data cStringUsingEncoding:NSUTF8StringEncoding]];

but I can't convert \U7ab6\U51b1 into '

It shows as 窶冱 which is what I don't want, it should show as an '. Can anyone help me?

? The character U+7AB6 is 窶 and U+51B1 is definitely 冱. How would that sequence ever represent an apostrophe? — bobince, Mar 27 '11 at 11:33
hi bobine, it is not an apostrophe but looks like one. I have paste it here from a word document, the first is an apostrophe ' it is shown as a ’ and created by a combination of \U7ab6\U51b1. I just want it shown as ’ — munchine, Mar 27 '11 at 23:18

score 3 · Accepted Answer · answered Mar 27 '11 at 23:41

it is shown as a ’

That's character U+2019 RIGHT SINGLE QUOTATION MARK.

What has happened is you've had the character sequence ’s submitted to you, in the UTF-8 encoding, which comes out as bytes:

’          s
E2 80 99   73

That byte sequence has then, incorrectly, been interpreted as if it were encoded in Windows code page 932 (Japanese; more or less Shift-JIS):

E2 80    99 73
窶        冱

So in this one particular case, you could recover the ’s string by firstly encoding the characters into cp932 bytes, and then decoding those bytes back to characters using UTF-8.

However, this will not solve your real problem, which is that the strings were read in incorrectly in the first place. You got 窶冱 in this case because the UTF-8 byte sequence resulting from encoding ’s happened also to be a valid Shift-JIS byte sequence. But that won't be the case for all possible UTF-8 byte sequences you might get. Many other characters will be unrecoverably mangled.

You need to find where bytes are being read into the system and decoded as Shift-JIS, and fix that to use UTF-8 instead.

NSString unicode encoding problem

1 Answers1

Linked