Is UTF-16 compatible with UTF-8?

Question

I asked Google the question above and was sent to Difference between UTF-8 and UTF-16? which unfortunately doesn't answer the question.

From my understanding UTF-8 should be a subset of UTF-16 meaning: if my code uses UTF-16 and I hand in a UTF-8 encoded string everything should always be fine. The other way around (expecting UTF-8 and getting UTF-16) may cause problems.

Is that correct?

EDIT: To clarify why the linked SO question doesn't answer my question: My problem arose when trying to process a JSON string using WebClient.DownloadString, because the WebClient used the wrong encoding. The JSON I received from the request was encoded as UTF-8 and the question for me was: if I set webClient.Encoding = New System.Text.UnicodeEncoding (a.k.a UTF-16) would I be on the safe side, i.e. able to handle UTF-8 and UTF-16 request results, or should I use webClient.Encoding = New System.Text.UTF8Encoding?

What do you mean by "hand in"? They encode the same set of characters, but a byte sequence in UTF-8 won't represent the same set of characters if it's interpreted as UTF-16. It would really help if you'd give more details about what you're trying to do. — Jon Skeet, Sep 10 '15 at 10:56
possible duplicate of [Difference between UTF-8 and UTF-16?](http://stackoverflow.com/questions/4655250/difference-between-utf-8-and-utf-16) — tripleee, Sep 10 '15 at 11:30
[What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/) — deceze, Sep 10 '15 at 11:40
No, that is not correct. Not all UTF-8 encoded bytes are valid UTF-16 bytes, as well as vice versa. There is no way to pick the right encoding that can handle both; you need to know the encoding of your input and treat it accordingly. — jrochkind, Sep 16 '15 at 02:57

tripleee · Accepted Answer · 2022-07-04T06:36:22.830

It's not entirely clear what you mean by "compatible", so let's get some basics out of the way.

Unicode is the underlying concept, and UTF-16 and UTF-8 are two different ways to encode Unicode. They are obviously different -- otherwise, why would there be two different serialization formats?

Unicode by itself does not specify a serialization format. UTF-8 and UTF-16 are two alternative serialization formats. There are several others, but these two are arguably the most widely used.

They are "compatible" in the sense that they can represent the same Unicode code points, but "incompatible" in that the representations are completely different, and irreconcileable.

There are two additional twists with UTF-16. Firstly, there are actually two different encodings, UTF-16LE and UTF-16BE. These differ in endianness. (UTF-8 is a byte encoding, so does not have endianness.) Secondly, legacy UTF-16 used to be restricted to 65,536 possible characters, which is less than Unicode currently contains. This is handled with surrogates, but really old and/or broken UTF-16 implementations (properly identified as UCS-2, not "real" UTF-16) do not support them.

For a bit of concretion, let's compare four different code points. We pick U+0041, U+00E5, U+201C, and U+1F4A9, as they illustrate the differences nicely.

U+0041 is a 7-bit character, so UTF-8 represents it simply with a single byte. U+00E5 is an 8-bit character, so UTF-8 needs to encode it. U+1F4A9 is outside the Basic Multilingual Plane, so UTF-16 represents it with a surrogate sequence. Finally, U+201C is none of the above.

Here are the representations of our candidate characters in UTF-8, UTF-16LE, and UTF-16BE.

Character	UTF-8	UTF-16LE	UTF-16BE
U+0041 (a)	0x41	0x41 0x00	0x00 0x41
U+00E5 (å)	0xC3 0xA5	0xE5 0x00	0x00 0xE5
U+201C (“)	0xE2 0x80 0x9C	0x1C 0x20	0x20 0x1C
U+1F4A9 ()	0xF0 0x9F 0x92 0xA9	0x3D 0xD8 0xA9 0xDC	0xD8 0x3D 0xDC 0xA9

To pick one obvious example, the UTF-8 encoding of U+00E5 would represent a completely different character if interpreted as UTF-16 (in UTF-16LE, it would be U+A5C3, and in UTF-16BE, U+C3A5.) Any UTF-8 sequence with an odd number of bytes is an incomplete 16-bit sequence. I suppose UTF-8 when interpreted as UTF-16 could also happen to encode an invalid surrogate sequence. Conversely, many of the UTF-16 codes are not valid UTF-8 sequences at all. So in this sense, UTF-8 and UTF-16 are completely and utterly incompatible.

These are byte values; in ASCII, 0x00 is the NUL character (sometimes represented as ^@), 0x41 is uppercase A, and 0xE5 is undefined; in e.g. Latin-1 it represents the character å (which is also conveniently U+00E5 in Unicode), but in KOI8-R it is the Cyrillic character Е (U+0415), etc.

Perhaps notice also how the last example requires a nontrivial transformation in UTF-16, too, using a pair of surrogate code points, in some sense superficially similarly to how UTF-8 encodes all multibyte code points.

In modern programming languages, your code should simply use Unicode, and let the language handle the nitty-gritty of encoding it in a way which is suitable for your platform and libraries. On a somewhat tangential note, see also http://utf8everywhere.org/

Looking at the question you linked to, the answers there basically tell you this. I will nominate to close your question as a duplicate. — tripleee, Sep 10 '15 at 11:29
I disagree very much: it is very clear what's being asked 'is UTF-8 a subset of UTF-16?' to which the answer is clearly 'no'. — rasmus91, Sep 26 '19 at 04:29
I strongly disagree with "simply use unicode", because data sharing might require reencoding and not being aware of underlying encoding of a dataset might break the application in very annoying and subtle ways. — Jay-Pi, May 17 '22 at 10:30
Well, yes, there are situations where you need to understand encodings in order to work with them, but if all you want to do is handle text, many modern languages will abstract away the details of the internal representation and let you focus on the task at hand. — tripleee, May 17 '22 at 10:54
@tripleee sorry, i don't know why it took me 7 years to officially accept your answer... so, FWIW, thank again ;o) — mike, Jul 03 '22 at 19:53

Is UTF-16 compatible with UTF-8?

1 Answers1

Linked