2

I am confused about the byte representation of an emoji encoded in UTF8. My understanding is that UTF8 characters are variable in size, up to 4 bytes.

When I encode the ❤️ emoji in UTF8 on iOS 13, I get back 6 bytes:

NSString* heartEmoji = @"❤️";
NSData* utf8 = [heartEmoji dataUsingEncoding:NSUTF8StringEncoding];
NSLog(@"%@", utf8); // {length = 6, bytes = 0xe29da4efb88f}

If I revert the operation, just consuming the first 3 bytes, I get a unicode heart back:

BYTE bytes[3] = { 0 };
[utf8 getBytes:bytes length:3];
NSString* decoded = [[NSString alloc] initWithBytes:bytes length:3 encoding:NSUTF8StringEncoding];
NSLog(@"%@", decoded); // ❤

Note that I use the heart as an example; I tried with many emoji and most are 4 bytes in UTF8, but some are 6.

Do I have some faulty assumptions about UTF8? What can I do to represent all emoji in 4 bytes as UTF8?

TheNextman
  • 12,428
  • 2
  • 36
  • 75

1 Answers1

3

My understanding is that UTF8 characters are variable in size, up to 4 bytes.

This is not quite correct. A UTF8 code point is up to 4 bytes. But a character (specifically an extended grapheme cluster), can be much longer due to combining characters. Dozens of bytes at a minimum, and unlimited in the most extreme cases. See Why are emoji characters like 👩‍👩‍👧‍👦 treated so strangely in Swift strings? for an interesting example.

In your example, your emoji is HEAVY BACK HEART (U+2764) followed by VARIATION SELECTOR-16 (U+FE0F) which indicates that it should be red. UTF-8 requires three bytes to encode each of those code points.

Rob Napier
  • 286,113
  • 34
  • 456
  • 610
  • Thanks Rob, you answered my question. Let's say I want to send my ❤️ over the wire, encoded as UTF8, in 4-byte arrays. On the other end, I'm converting back into a key sequence. If I split my 6 bytes into two separate 3 byte packets, it works well. Is it safe to assume that if the length is greater than 4, I should process in 3 byte code points? – TheNextman Mar 19 '20 at 23:06
  • 2
    Definitely not. UTF-8 can be combined in numerous ways with code points of arbitrary length (between 1 and 4). The fact that both of these combining characters happens to be 3 bytes long in UTF-8 is a coincidence. ‍‍‍ is 22 bytes long in UTF-8, made up of 7 code points, 4 of which are 4 bytes long, and 3 of which are 2 bytes long. UTF-8 should be sent as a stream of bytes (that's its whole purpose). If you're going to force each code point into a 4-byte chunk, you might as well use UTF-32, which is exactly that. – Rob Napier Mar 19 '20 at 23:49
  • Thanks for the clarification – TheNextman Mar 20 '20 at 01:07