Can the iOS keyboard produce a 6-byte UTF-8 character?

Question

According to this answer, UTF-8 encoded characters used to have a byte limit of 6 but have since been limited to 4. Some comments and answers in that post suggest that it's possible, maybe just in theory, to produce UTF-8 characters beyond 6 bytes. All that said, can the iOS keyboard produce a 6-byte UTF-8 character? What is the biggest UTF-8 character that the iOS keyboard can produce and how would one determine that? And do these limits apply to both what the keyboard can output and what the user can copy and paste into a text field?

Yes, UTF-8 can *theoretically* encode a single Unicode codepoint with more than 4 bytes, but *in practice* that will never happen since there are no Unicode codepoints defined that are higher than U+10FFFF. This is why the UTF-8 specs *artificially* limit its encoding to 4 bytes max (to remain compatible with UTF-16, which can't *physically* encode a single codepoint that is higher than U+10FFFF). — Remy Lebeau, Mar 08 '23 at 01:47

HangarRash · Accepted Answer · 2023-03-03T03:33:05.420

1

Many Emoji characters are 8-bytes in UTF-8. For example, all the country flag Emojis are 8 bytes in UTF-8 encoding. This is because the flag Emojis are actually comprised of 2 Unicode characters each.

Some of the "people" Emojis are over 8 bytes in UTF-8 encoding. For example (picked at random), the "man vampire", character "U+1F9DB U+200D U+2642 U+FE0F" (that's 4 Unicode characters), is "F0 9F A7 9B E2 80 8D E2 99 82 EF B8 8F" in UTF-8 encoding.

So yes, a single Emoji character can produce a UTF-8 encoding that is over 4 bytes.

Technically a single Unicode character is 6 bytes or less but many characters that can be entered on an iOS keyboard, such as many Emoji symbols) are actually made of multiple Unicode characters which allows the UTF-8 encoding of these symbols to be well over 4 or 6 bytes.

edited Mar 03 '23 at 03:33

answered Mar 02 '23 at 20:11

HangarRash

7,314
5
5
32

So when I execute `for x in "".utf8 {print(x)}` in an Xcode Playground, for example, `x` represents a single UTF-8 encoded byte, correct? – lurning too koad Mar 02 '23 at 21:36
1

That is correct. In your example you see the 8 bytes of the flag in UTF-8 encoding. – HangarRash Mar 02 '23 at 21:41
Just so someone verifies my math, I'm working with a database with a document limit of 1,048,576 bytes (1MB). Strings are calculated as the number of UTF-8 encoded bytes. The largest-possible UTF-8 encoded character is 8 bytes we just established. Therefore, if this document contained nothing but a single string, that string should not be allowed to be more than 131,072 characters, correct? – lurning too koad Mar 03 '23 at 02:10
@kidcoder It's not that simple. Simple text like this would be one byte per character in UTF-8 encoding. So if the user only entered basic ASCII characters, they could enter over 1MB of characters. If they entered nothing but a bunch of "man vampire" emojis then you can only support about 80,650 characters. So you should let them type anything but limit the string so the UTF-8 encoding is under 1MB. – HangarRash Mar 03 '23 at 02:16
About the man vampire, why does it print 13 bytes with `for x in "‍♂️".utf8 {print(x)}`? Is this emoji 13 UTF-8-encoded bytes? – lurning too koad Mar 03 '23 at 02:39
@kidcoder Yes, that character's (and many others) UTF-8 encoding is 13 bytes. – HangarRash Mar 03 '23 at 03:00
I'm confused again. Why does this article, and many others, say that 4 bytes is the maximum for any UTF-8 char? https://stijndewitt.com/2014/08/09/max-bytes-in-a-utf-8-char/ – lurning too koad Mar 03 '23 at 03:05
@kidcoder That article is several years old and it doesn't address characters like the "man vampire" which is actually made up of 4 Unicode characters (which combined encoded as UTF-8 gives 13 bytes) or the flags which are made up of 2 Unicode characters (which combined encoded as UTF-8 gives 8 bytes). So yes, technically, a single Unicode character is at most 6 bytes but many Emoji and other symbols are made of more than one Unicode character. – HangarRash Mar 03 '23 at 03:29
I appreciate all of your help and knowledge here. I think what I should do is limit text inputs to byte count and not character count. Woman vampire is 17 bytes! – lurning too koad Mar 03 '23 at 03:32
@kidcoder I updated the answer to better reflect that additional information. – HangarRash Mar 03 '23 at 03:33
1

@kidcoder And start adding skin tone to many of the people Emoji symbols and the byte count goes up even more. – HangarRash Mar 03 '23 at 03:34
1

@kidcoder "*Why does this article, and many others, say that 4 bytes is the maximum for any UTF-8 char?*" - because that is the max byte count for a single Unicode **codepoint**. Unicode doesn't deal in **characters**. What you think of as a "character" is more formally known as a **grapheme cluster**, which can consist of 1 codepoint acting alone, or 2+ codepoints acting together as one unit. Most Emojis consist of more than 1 codepoint. – Remy Lebeau Mar 08 '23 at 01:51
@kidcoder "*The largest-possible UTF-8 encoded character is 8 bytes*" - nope, not even close. Emojis tend to combine only a handful of codepoints at a time, so a single Emoji "character" *may or may not* fit within 8 bytes in UTF-8 (as the answer and comments above demonstrate, there are Emoji "characters" that are well over 8 bytes in UTF-8). And then there are things like [Zalgo text](https://stackoverflow.com/questions/6579844/), for example, which is basically *unlimited* in how many codepoints it combines, producing some pretty horrific text, but even worse when figuring out text limits. – Remy Lebeau Mar 08 '23 at 02:04

Can the iOS keyboard produce a 6-byte UTF-8 character?

1 Answers1