0

I know that unicode is a huge symbol set. It enable Japan an China alphabet.

I am reading information about UTF-8/16/32 but nowhere I cannot find phrase that I can encode absolutely any symbol from unicode using UTF-8/16/32.

Is it truth that every unicode encoding has same power ?

if it is true then what the reason to use utf-16/32 if in common - utf-8 use memory more "decently" and ascii compatibility?

gstackoverflow
  • 36,709
  • 117
  • 359
  • 710
  • UTF-8/16/32/Batman are all just different ways of representing sequences of Unicode code points. They're all equally expressive and mutually convertible; they just have different trade-offs in terms of storage and processing. – Kerrek SB Jul 07 '14 at 08:42
  • 1
    possible duplicate of [Do UTF-8,UTF-16, and UTF-32 Unicode encodings differ in the number of characters they can store?](http://stackoverflow.com/questions/130438/do-utf-8-utf-16-and-utf-32-unicode-encodings-differ-in-the-number-of-characters) – Jukka K. Korpela Jul 07 '14 at 09:01

2 Answers2

2

Yes you can. For all readers: Unicode is a numbering from U+0000 upto a 3 byte range. UTF-8 is a multibyte code that chains bytes till with set high bit, some sequence bit(s) and free numbering bits. UTF-16 has also an escaping. And UTF-32 suffices.

For Asian scripts UTF-8 is not optimal, for latin script it is optimal. In general that would only play a role on small devices, or huge databases.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • 1
    "For Asian scripts UTF-8 is not optimal, for latin script it is optimal." And yet even though UTF-8 is not an optimal encoding for Asian scripts, it can be the best encoding in many situations *involving* Asian scripts. For example, [the Wikipedia page for the Chinese language, in Chinese](https://zh.wikipedia.org/wiki/汉语) is 318,945 bytes in UTF-8. Convert that same HTML document to UTF-16 (which uses only 2 bytes per Chinese character, while UTF-8 uses 3), and its size becomes 542,040 bytes. The amount of Latin script in HTML overwhelms the space savings from efficiently encoding Chinese. – rmunn Sep 22 '15 at 16:13
  • 1
    Therefore, "just use UTF-8 unless you have a very good reason not to" is actually a really good rule of thumb. – rmunn Sep 22 '15 at 16:14
1

All UTF-x encodings can represent all Unicode codepoint sequences.

With UTF-32, each codepoint requires 4 bytes.

With UTF-16, most codepoints use 2 bytes; exotic codepoints use 4 bytes via UTF-16 surrogates.

With UTF-8, a codepoint may use between 1 to 4 bytes.

With European character sets, UTF-8 is the most memory efficient encoding.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
laune
  • 31,114
  • 3
  • 29
  • 42