UTF-8 Encoding size

Question

what unicode characters fit in 1, 2, 4 bytes? Can someone point me to complete character chart?

Read this first: http://www.joelonsoftware.com/articles/Unicode.html — Cody Gray - on strike, Feb 03 '11 at 10:05
A complete chart? That's going to be a HUGE one. See this for a printed version of the Basic Multilingual Plane (there are 16 more): http://shop.designinmainz.de/en/Poster/decodeunicode-Basic-Multilingual-Plane-BMP-Map See DecodeUnicode for a wiki-like representation of Unicode characters: http://www.decodeunicode.org/en — Piskvor left the building, Feb 03 '11 at 10:09
You could also read about Universal Codes: http://en.wikipedia.org/wiki/Universal_code_%28data_compression%29 — ruslik, Feb 03 '11 at 10:25
Possible duplicate of [How many characters can UTF-8 encode?](https://stackoverflow.com/questions/10229156/how-many-characters-can-utf-8-encode) — tripleee, Dec 12 '17 at 10:08

score 27 · Accepted Answer · edited Nov 21 '11 at 23:36

27

Characters are encoded according to their position in the range. You can actually find the algorithm on the Wikipedia page for UTF8 - you can implement it very quickly Wikipedia UTF8 Encoding

U+0000 to U+007F are (correctly) encoded with one byte
U+0080 to U+07FF are encoded with 2 bytes
U+0800 to U+FFFF are encoded with 3 bytes
U+010000 to U+10FFFF are encoded with 4 bytes

edited Nov 21 '11 at 23:36

Brian Deragon

2,929
24
44

answered Feb 03 '11 at 10:02

Jimmy

6,001
1
22
21

score 6 · Answer 2 · edited Nov 05 '18 at 19:31

The wikipedia article on UTF-8 has a good enough description of the encoding:

1 byte = code points 0x000000 to 0x00007F (inclusive)
2 bytes = code points 0x000080 to 0x0007FF
3 bytes = code points 0x000800 to 0x00FFFF
4 bytes = code points 0x010000 to 0x10FFFF

The charts can be downloaded directly from unicode.org. It's a set of about 150 PDF files, because a single chart would be huge (maybe 30 MiB).

Also be aware that Unicode (compared to something like ASCII) is much more complex to process - there's things like right-to-left text, byte order marks, code points that can be combined ("composed") to create a single character and different ways of representing the exact same string (and a process to convert strings into a canonical form suitable for comparison), a lot more white-space characters, etc. I'd recommend downloading the entire Unicode specification and reading most of it if you're planning to do more than "not much".

Michael · Answer 3 · 2011-02-03T10:41:58.250

1

UTF-8 compromises of 1 to a limit of 6 bytes, although the current amount of code points is covered with just 4 bytes. UTF-8 uses the first byte to determine how long (in bytes) the character is - see the various links to the Wiki page:

UTF-8 Wikipedia

Single byte UTF-8 is effectively ASCII - UTF-8 was designed to be compatible with it, which is why it's more prevalent than UTF-16, for example.

Edit: Apparently, it was agreed the UTF-8's code points would not exceed 21 bits (4 byte sequences) - but it has the technical capability to handle up to 31 bits (6 byte UTF-8).

edited Feb 03 '11 at 10:41

answered Feb 03 '11 at 10:22

Michael

7,348
10
49
86

UTF-8 is limited to 4 bytes. Unicode code points are limited to U+1FFFFF (21 bits), and UTF-8 encoding is canonical (must choose shortest). Therefore, you can never end up with a 5 byte UTF-8 sequence. Either it would decode to a character past U+1FFFFF, or it would not be canonical. – MSalters Feb 03 '11 at 10:29
UTF-8's current character set only uses 4 bytes, but it was designed for code points up to 31 bits - resulting in a 6 byte sequence. – Michael Feb 03 '11 at 10:35
4

*6-byte characters*? [shudder] – Piskvor left the building Feb 03 '11 at 11:25
You're correct, though Wiki has so much dumby history that I'm angry to scroll and read through it xd – Mar 18 '17 at 10:49
I was looking for an answer mentioning the 6 bytes maximum like this, but I was also wanting to know the max code point. I thought it'd be U+FFFFFF, like ECMAScript 4. – Mar 18 '17 at 10:50

UTF-8 Encoding size

3 Answers3