Regarding unicode characters and their utf8 binary representation

Question

Out of curiosity, i wonder why for example a character "ł" with code point 322 has a UTF8 binary representation of 11000101:10000010 in decimal 197:130 and not its actual binary representation 00000001:01000010 in decimal 1:66 ?

See, amongst others, [How does a file with Chinese characters know how many bytes to use per character?](https://stackoverflow.com/questions/775412/) and [Really good bad UTF-8 example test data](https://stackoverflow.com/questions/1319022/) and [If UTF-8 is an 8-bit encoding, why does it need 1-4 bytes?](https://stackoverflow.com/questions/6338944/) — Jonathan Leffler, Jun 19 '17 at 04:30

Jonathan Leffler · Answer 1 · 2017-06-19T18:14:18.937

UTF-8 encodes Unicode code points in the range U+0000..U+007F in a single byte. Code points in the range U+0080..U+07FF use 2 bytes, code points in the range U+0800..U+FFFF use 3 bytes, and code points in the range U+10000..U+10FFFF use 4 bytes.

When the code point needs two bytes, then the first byte starts with the bit pattern 110; the remaining 5 bits are the high order 5 bits of the Unicode code point. The continuation byte starts with the bit pattern 10; the remaining 6 bits are the low order 6 bits of the Unicode code point.

You are looking at ł U+0142 LATIN SMALL LETTER L WITH STROKE (decimal 322). The bit pattern representing hexadecimal 142 is:

00000001 01000010

With the UTF-8 sub-field grouping marked by colons, that is:

00000:001 01:000010

So the UTF-8 code is:

110:00101 10:000010
11000101  10000010
0xC5      0x82
197       130

The same basic ideas apply to 3-byte and 4-byte encodings — you chop off 6-bits per continuation byte, and combine the leading bits with the appropriate marker bits (1110 for 3 bytes; 11110 for 4 bytes — there are as many leading 1 bits as there are bytes in the complete character). There are a bunch of other rules that don't matter much to you right now. For example, you never encode a UTF-16 high surrogate (U+D800..U+DBFF) or a low surrogate (U+DC00..UDFFF) in UTF-8 (or UTF-32, come to that). You never encode a non-minimal sequence (so although bytes 0xC0 0x80 could be used to encode U+0000, this is invalid). One consequence of these rules is that the bytes 0xC0 and 0xC1 are never valid in UTF-8 (and neither are 0xF5..0xFF).

Deltics · Answer 2 · 2017-06-19T04:52:48.433

1

UTF8 is designed for compatibility with with 7-bit ASCII.

To achieve this the most significant bit of bytes in a UTF8 encoded byte sequence is used to signal whether a byte is part of a multi-byte encoded code point. If the MSB is set, then the byte is part of a sequence of 2 or more bytes that encode a single code point. If the MSB is not set then the byte encodes a code point in the range 0..127.

Therefore in UTF8 the byte sequence [1][66] represents the two code points 1 and 66 respectively since the MSB is not set (=0) in either byte.

Furthermore, the code point #322 must be encoded using a sequence of bytes where the MSB is set (=1) in each byte.

The precise details of UTF8 encoding are quite a bit more complex but there are numerous resources that go into those details.

edited Jun 19 '17 at 04:52

answered Jun 19 '17 at 04:03

Deltics

22,162
2
42
70

So how does one "translate" `11000101:10000010` to `322` to represent "ł" – laiboonh Jun 19 '17 at 04:10
1

Methinks ASCII is a 7-bit code, not a 127-bit code. It has 128 characters, values 0..127 — which is probably where your '127-bit ASCII' comes from. I've always heard of the U+0000 .. U+007F codes as being single bytes that need no continuation bytes to complete them. – Jonathan Leffler Jun 19 '17 at 04:23
https://stackoverflow.com/questions/6240055/manually-converting-unicode-codepoints-into-utf-8-and-utf-16 – laiboonh Jun 19 '17 at 04:25
@JonathanLeffler - lol. Methinks you could be right. Simple typo - easily corrected. :) – Deltics Jun 19 '17 at 04:40
Thanks; the easy fix was indeed the easy fix. I fear that you need to review the encoding for UTF-8 more closely. Your description isn't entirely accurate. – Jonathan Leffler Jun 19 '17 at 04:45
@user1677501 - 11000101:10000010 already "represents" `{`. I presume you mean how do you transcode the UTF8 encoding to some other UTF encoding appropriate to your needs ? The answer to that very much depends on your development environment and needs. Most dev toolkits, platforms or frameworks provide facilities for transcoding between UTF schemes. – Deltics Jun 19 '17 at 04:45
@JonathanLeffler - yep, in simplifying I screwed up entirely. Hopefully fixed now, without getting too complicated for OP. – Deltics Jun 19 '17 at 04:53

Regarding unicode characters and their utf8 binary representation

2 Answers2