Out of curiosity, i wonder why for example a character "ł" with code point 322
has a UTF8
binary representation of 11000101:10000010
in decimal 197:130
and not its actual binary representation 00000001:01000010
in decimal 1:66
?
-
See, amongst others, [How does a file with Chinese characters know how many bytes to use per character?](https://stackoverflow.com/questions/775412/) and [Really good bad UTF-8 example test data](https://stackoverflow.com/questions/1319022/) and [If UTF-8 is an 8-bit encoding, why does it need 1-4 bytes?](https://stackoverflow.com/questions/6338944/) – Jonathan Leffler Jun 19 '17 at 04:30
2 Answers
UTF-8 encodes Unicode code points in the range U+0000..U+007F in a single byte. Code points in the range U+0080..U+07FF use 2 bytes, code points in the range U+0800..U+FFFF use 3 bytes, and code points in the range U+10000..U+10FFFF use 4 bytes.
When the code point needs two bytes, then the first byte starts with the bit pattern 110; the remaining 5 bits are the high order 5 bits of the Unicode code point. The continuation byte starts with the bit pattern 10; the remaining 6 bits are the low order 6 bits of the Unicode code point.
You are looking at ł U+0142 LATIN SMALL LETTER L WITH STROKE (decimal 322). The bit pattern representing hexadecimal 142 is:
00000001 01000010
With the UTF-8 sub-field grouping marked by colons, that is:
00000:001 01:000010
So the UTF-8 code is:
110:00101 10:000010
11000101 10000010
0xC5 0x82
197 130
The same basic ideas apply to 3-byte and 4-byte encodings — you chop off 6-bits per continuation byte, and combine the leading bits with the appropriate marker bits (1110 for 3 bytes; 11110 for 4 bytes — there are as many leading 1 bits as there are bytes in the complete character). There are a bunch of other rules that don't matter much to you right now. For example, you never encode a UTF-16 high surrogate (U+D800..U+DBFF) or a low surrogate (U+DC00..UDFFF) in UTF-8 (or UTF-32, come to that). You never encode a non-minimal sequence (so although bytes 0xC0 0x80 could be used to encode U+0000, this is invalid). One consequence of these rules is that the bytes 0xC0 and 0xC1 are never valid in UTF-8 (and neither are 0xF5..0xFF).

- 730,956
- 141
- 904
- 1,278
UTF8 is designed for compatibility with with 7-bit ASCII.
To achieve this the most significant bit of bytes in a UTF8 encoded byte sequence is used to signal whether a byte is part of a multi-byte encoded code point. If the MSB is set, then the byte is part of a sequence of 2 or more bytes that encode a single code point. If the MSB is not set then the byte encodes a code point in the range 0..127.
Therefore in UTF8 the byte sequence [1][66]
represents the two code points 1 and 66 respectively since the MSB is not set (=0) in either byte.
Furthermore, the code point #322 must be encoded using a sequence of bytes where the MSB is set (=1) in each byte.
The precise details of UTF8 encoding are quite a bit more complex but there are numerous resources that go into those details.

- 22,162
- 2
- 42
- 70
-
So how does one "translate" `11000101:10000010` to `322` to represent "ł" – laiboonh Jun 19 '17 at 04:10
-
1Methinks ASCII is a 7-bit code, not a 127-bit code. It has 128 characters, values 0..127 — which is probably where your '127-bit ASCII' comes from. I've always heard of the U+0000 .. U+007F codes as being single bytes that need no continuation bytes to complete them. – Jonathan Leffler Jun 19 '17 at 04:23
-
https://stackoverflow.com/questions/6240055/manually-converting-unicode-codepoints-into-utf-8-and-utf-16 – laiboonh Jun 19 '17 at 04:25
-
@JonathanLeffler - lol. Methinks you could be right. Simple typo - easily corrected. :) – Deltics Jun 19 '17 at 04:40
-
Thanks; the easy fix was indeed the easy fix. I fear that you need to review the encoding for UTF-8 more closely. Your description isn't entirely accurate. – Jonathan Leffler Jun 19 '17 at 04:45
-
@user1677501 - 11000101:10000010 already "represents" `{`. I presume you mean how do you transcode the UTF8 encoding to some other UTF encoding appropriate to your needs ? The answer to that very much depends on your development environment and needs. Most dev toolkits, platforms or frameworks provide facilities for transcoding between UTF schemes. – Deltics Jun 19 '17 at 04:45
-
@JonathanLeffler - yep, in simplifying I screwed up entirely. Hopefully fixed now, without getting too complicated for OP. – Deltics Jun 19 '17 at 04:53