0

I've read that Windows CE uses the "UTF-16 version of UNICODE" (i'm a newbie with encodings).

What happens when a string contains a character that requires more that 2 bytes, like chinese characters ? Does it take 3 ? If i have a string containing chinese characters, accessing the N-th couple of bytes will not necessaily access the N-th visible symbol ?

Also what about performance ? If i understand well, encodings that have a variable number of bytes per visible symbol require the string to be scanned from the beginning to access the N-th visible symbol right ? If yes is it also true for UTF-16 ?

Thank you.

Virus721
  • 8,061
  • 12
  • 67
  • 123
  • See 1) [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html), 2) [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/), and 3) [UTF-16](https://en.wikipedia.org/wiki/UTF-16). – Remy Lebeau Mar 02 '15 at 02:36

1 Answers1

1

What happens when a string contains a character that requires more that 2 bytes, like Chinese characters? Does it take 3?

No, four.

Wikipedia: UTF-16:

In UTF-16, code points greater or equal to 216 are encoded using two 16-bit code units.


If I understand well, encodings that have a variable number of bytes per visible symbol require the string to be scanned from the beginning to access the N-th visible symbol right?

Yes. See for example Why use multibyte string functions in PHP?.

Community
  • 1
  • 1
CodeCaster
  • 147,647
  • 23
  • 218
  • 272