Need help understanding UTF encodings

Question

Hallo, I have noticed that when I save a text file using UTF-8 encoding (no BOM), I am able to read it perfectly using the UTF-16 encoding on C#. Now this got me a little confused cause UTF-8 only uses 8 bits, right? And utf-16 takes, well, 16 bits for each character.

Now imagine that I have the string "ab" written in this file as UTF-8, then there is one byte there for the letter "a" & another one for the "b".

Ok, but how is it possible to read this UTF-8 file when using UTF-16 charset? The way I see it, while reading the file, the two bytes of the "ab" would be mistaken into been only one character containing both bytes. Because UTF-16 needs those 2 bytes.

This is how I read it (t.txt is encoded as UTF-8):

using(StreamReader sr = new StreamReader(File.OpenRead("t.txt"), Encoding.GetEncoding("utf-16")))
{
    Console.Write(sr.ReadToEnd());
    Console.ReadKey();
}

UTF-8 uses 8 bits when you are dealing with English -- but if you are dealing with other languages UTF-8 could be 16, 24, or even more bits. — Sai, Jun 11 '11 at 04:27
@Sai, oh, I thought that utf-8 would always be 8 bits long and when using 16 bits it would then be called utf-16. So I could have 16 bits and still be using utf-8 and not utf-16? — Delta, Jun 11 '11 at 15:48
@tchris ok, but if utf-16 needs AT LEAST 2 bytes, and a file encoded as utf-8 could have characteres using only 1 byte. How come it works anyway when decoding as utf-16? Does it simply adds a 0x00 byte when he knows the character only uses 1 byte under the hood? But if he does then there would be no difference from utf-8. I'm Not understanding. — Delta, Jun 11 '11 at 15:53
@Delta with UTF-8 different characters could have different lengths. For instance regular english characters will take 8 bits but other character sets like for example Tamil will take more bits. You can take a look at http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html -- which has a great explanation. — Sai, Jun 11 '11 at 17:30
@Sai: With UTF-16 different characters could have different lengths. Only UTF-32 has fixed width, not UTF-16 and not UTF-8. How come people get this confused? — tchrist, Jun 12 '11 at 01:51
@tchrist I guess if it was called "UTF-8orMore" it would be more obvious. But that would be kind of silly... — Tyler, Jun 14 '11 at 17:32
@Sai It’s *emphatically ɴᴏᴛ ᴛʀᴜᴇ* that “UTF‑8 uses 8 bits when you are dealing with English” as you have here alleged. This is a distressingly commonly‐held misunderstanding — and it’s far worse than simply being not true: ***it’s actually harmful!*** That’s my quick 2¢ for now… but within the next 2–6 (call it 4±2) weeks I anticipate writing an essay‐length ꜰᴍᴛᴇʏᴇᴡᴛᴋ™ about this. If my history with these is any guide, I have every reason to predict this new ꜰᴍᴛᴇʏᴇᴡᴛᴋ *shall* become **the definitive treatise** about all this for decades to come. ***Tʀᴜsᴛ Mᴇ: history suggests you should!*** — tchrist, Jun 14 '11 at 21:26

score 5 · Accepted Answer · answered Jun 11 '11 at 04:25

5

Check out http://www.joelonsoftware.com/articles/Unicode.html, it will answer all your unicode questions

answered Jun 11 '11 at 04:25

Andrew dh

881
9
19

Good article, he said that utf-8 can store any code point while other encodings like iso-8859-1, windows-1252 etc just some. Now I wonder why doesn't everybody just use utf-8. – Delta Jun 11 '11 at 05:11
Most of the newer browsers recommend using UTF-8 for use on webpages. As for other applications, I suppose the hindrance in its adoption is the variable length of each character. Unlike fixed length encodings, one cannot simply get to the nth character by using `offset = n * encodingLength`. – Devendra D. Chavan Jun 11 '11 at 05:30
@Devendra: Then you had best use UTF‑32, because UTF‑16 doesn’t have that property. Anybody who thinks they can use simple indexing into UTF‑16 to get to the ᵗʰ character is in grave error, as demonstrated by this comment. And there are many very good reasons to use UTF-8 for web pages. Some are in [this answers](http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129)’s section. – tchrist Jun 11 '11 at 15:17

Devendra D. Chavan · Answer 2 · 2011-06-12T01:42:58.327

The '8' means it uses 8-bit blocks to represent a character. This does not mean that each character takes a fixed 8 bits. The number of blocks per character vary from 1 to 4 (though characters can be theorically upto 6 bytes long).

Try this simple test,

Create a text file (in say Notepad++) with UTF8 without BOM encoding
Read the text file (as you have done in your code) with File.ReadAllBytes(). byte[] utf8 = File.ReadAllBytes(@"E:\SavedUTF8.txt");
Check the number of bytes in taken by each character.
Now try the same with a file encoded as ANSI byte[] ansi = File.ReadAllBytes(@"E:\SavedANSI.txt");
Compare the bytes per character for both encodings.

Note, File.ReadAllBytes() attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.

Interesting results
SavedUTF8.txt contains character

a : Number of bytes in the byte array = 1
© (UTF+00A9)(Alt+0169) : Number of bytes in the byte array = 2
€: (UTF+E0A080)(Alt+14721152) Number of bytes in the byte array = 3

ANSI encoding always takes 8 bits (i.e. in the above sample, the byte array will always be of size 1 irrespective of the character in the file). As pointed out by @tchrist, UTF16 takes 2 or 4 bytes per character (and not a fixed 2 bytes per character).

Encoding table (from here)
The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character:

U-00000000 – U-0000007F:    0xxxxxxx
U-00000080 – U-000007FF:    110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF:    1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF:    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF:    111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 – U-7FFFFFFF:    1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.

Determining the size of character

The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character.

This means that the leading bits for a 2 byte character (110) are different than the leading bits of a 3 byte character (1110). These leading bits can be used to uniquely identify the number of bytes a character takes.

More information

Yeah, so utf-8 uses 2 bytes when the charcode is greater than 127 etc. Which leaves me very curious about how can you know when the next character is going to use 1, 2, 3 or 4 bytes. Thanks! — Delta, Jun 11 '11 at 05:14
I have updated the answer to clarify the calculation of character byte size. BTW, one need not know what the size of the next character will be. The size will be calculated once the pointer reaches the bit sequence of the next character. — Devendra D. Chavan, Jun 11 '11 at 05:26
**This answer is wrong!** UTF-16 is variable width, using 16-bit code units, just as UTF-8 is variable width, using 8-bit code units. This statement is a lie: `Similarly, UTF16 will always consume 16 bits per character (it is fixed length as compared to variable length in UTF8)` — tchrist, Jun 11 '11 at 15:13
I stand corrected. As pointed out by @tchrist, UTF16 can take 2 to 4 bytes per character. I have updated the answer to reflect the same. — Devendra D. Chavan, Jun 12 '11 at 01:40

score 1 · Answer 3 · answered Jun 11 '11 at 04:26

1

take a look at the following article:

http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html

answered Jun 11 '11 at 04:26

Abdallah

62
3

Need help understanding UTF encodings

3 Answers3