5

What is the formula for determining the maximum number of UTF-8 bytes required to encode a given number of UTF-16 code units (i.e. the value of String.Length in C# / .NET)?

I see 3 possibilities:

  1. # of UTF-16 code units x 2

  2. # of UTF-16 code units x 3

  3. # of UTF-16 code units x 4

A UTF-16 code point is represented by either 1 or 2 code units, so we just need to consider the worst case scenario of a string filled with one or the other. If a UTF-16 string is composed entirely of 2 code unit code points, then we know the UTF-8 representation will be at most the same size, since the code points take up a maximum of 4 bytes in both representations, thus worst case is option (1) above.

So the interesting case to consider, which I don't know the answer to, is the maximum number of bytes that a single code unit UTF-16 code point can require in UTF-8 representation.

If all single code unit UTF-16 code points can be represented with 3 UTF-8 bytes, which my gut tells me makes the most sense, then option (2) will be the worst case scenario. If there are any that require 4 bytes then option (3) will be the answer.

Does anyone have insight into which is correct? I'm really hoping for (1) or (2) as (3) is going to make things a lot harder :/

UPDATE

From what I can gather, UTF-16 encodes all characters in the BMP in a single code unit, and all other planes are encoded in 2 code units.

It seems that UTF-8 can encode the entire BMP within 3 bytes and uses 4 bytes for encoding the other planes.

Thus it seems to me that option (2) above is the correct answer, and this should work:

string str = "Some string";
int maxUtf8EncodedSize = str.Length * 3;

Does that seem like it checks out?

Mike Marynowski
  • 3,156
  • 22
  • 32
  • 1
    I think you can use the tables shown in Wikipedia for this. Both [UTF-16](https://en.wikipedia.org/wiki/UTF-16) and [UTF-8](https://en.wikipedia.org/wiki/UTF-8) use 4 bytes for code points in the supplemental planes. You can derive the bytes used in the BMP using the standard ranges. 2 bytes for UTF-16 and up to 3 bytes for UTF-8 (in the range U+0800 - U+FFFF). – Jimi Mar 08 '19 at 04:10
  • Since it's not clear why you need to pre-calculate a (hypothetical) number of bytes, maybe give a look at [Encoding.GetMaxByteCount](https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding.getmaxbytecount) and friends. It may be interesting. – Jimi Mar 08 '19 at 04:29
  • @Jimi String length counting everywhere in .NET is based around UTF-16 code unit counting, including UI controls (i.e. `TextBox.MaxLength`) and I need to set `MaxLength` based on max allowed UTF-8 encoded size and show a live `char count / max chars` label underneath. I want to avoid the *huge* mess and complication of substituting all the built in length calculations with UTF8 encoded length calculations for this purpose, which will be fine as long as I can guarantee `string.Length * 3` is the max size. If it's `string.Length * 4` then I'm boned because it will be too restrictive. – Mike Marynowski Mar 08 '19 at 05:22
  • @MikeMarynowski String length in .Net just counts the number of `Char` objects in the string. The fact these are internally treated as UTF-16 has no influence on that; unicode characters with an internal value exceeding 2-byte storage in UTF-16 are still treated as a single Char by `String.Length`. The whole system is deliberately designed so you never have to take the internal encoding into account. – Nyerguds Mar 08 '19 at 08:42
  • 1
    @Nyerguds That is incorrect. It counts the number of chars and chars are UTF-16 code units NOT code points. Characters that take 2 UTF-16 code units will be counted as a string of length 2 even though it only displays as a single character. – Mike Marynowski Mar 08 '19 at 15:13
  • @Nyerguds There are lots of situations that require you to look at the internal encoding considering that affects how .NET counts characters, this being one of them. I run into this all the time when working on international applications that end up dealing with chinese characters and such. – Mike Marynowski Mar 08 '19 at 15:15
  • There is no way to represent characters past the BMP plane with a single `char`. See https://learn.microsoft.com/en-us/dotnet/api/system.char?view=netframework-4.7.2 – Mike Marynowski Mar 08 '19 at 15:21
  • Clearing up a few points not relevant to the answer... "UTF-16 code points" : no such thing. Codepoints are members of the character set, not a character encoding's code unit values. "Single code unit UTF-16 code point can require in UTF-8 representation": in general cannot convert from an encoding's code unit to anything else because it could be only part of the representation of a codepoint. – Tom Blodget Mar 09 '19 at 01:55
  • @MikeMarynowski What you say is only true for cases like diacritics split off as separate characters. But for code points with a value higher than what can be stored in the standard two bytes in UTF-16, it will _still_ only be one `Char` object, and will thus still only count as 1 in the length. – Nyerguds Mar 10 '19 at 02:32
  • 1
    @Nyerguds There’s no way that’s possible considering chars are 16 bit value types. The docs I linked make this pretty clear. Show me a char that you can put in a string past the BMP that you can represent in a single char. – Mike Marynowski Mar 10 '19 at 02:53
  • Huh. I see. I honestly never realized Char struct only went up to 16 bit. – Nyerguds Mar 10 '19 at 20:27
  • @TomBlodget I didn't ask about converting from a code unit to anything else - the line you are quoting is asking about converting code points from UTF-16 to UTF-8, specifically those that can be represented by a single UTF-16 code unit. – Mike Marynowski Mar 11 '19 at 18:10
  • https://stackoverflow.com/questions/5728045/c-most-efficient-way-to-determine-how-many-bytes-will-be-needed-for-a-utf-16-st – hippietrail Nov 08 '19 at 14:22
  • 1
    @hippietrail That's for the reverse direction which I don't need but thanks for the additional info! – Mike Marynowski Nov 08 '19 at 18:22

2 Answers2

4

The worst case for a single UTF-16 word is U+FFFF which in UTF-16 is encoded just as-is (0xFFFF) Cyberchef. In UTF-8 it is encoded to ef bf bf (three bytes).

The worst case for two UTF-16 words (a "surrogate pair") is U+10FFFF which in UTF-16 is encoded as 0xDBFF DFFF. In UTF-8 it is encoded to f3 cf bf bf (four bytes).

Therefore the worst case is a load of U+FFFF's which will convert a UTF-16 string of length 2N bytes to a UTF-8 string of length 3N bytes.

So yes, you are correct. I don't think you need to consider stuff like glyphs because that sort of thing is done after decoding from UTF8/16 to code points.

Nikita Volkov
  • 42,792
  • 11
  • 94
  • 169
Timmmm
  • 88,195
  • 71
  • 364
  • 509
2

Properly formed UTF-8 can be up to 4 bytes per Unicode codepoint.

UTF-16-encoded characters can be up to 2 16-bit sequences per Unicode codepoint.

Characters outside the basic multilingual plane (including emoji and languages that were added to more recent versions of Unicode) are represented in up to 21 bits, which in the UTF-8 format results in 4 byte sequences, which turn out to also take up 4 bytes in UTF-16.

However, there are some environments that do things weirdly. Since UTF-16 characters outside the basic multilingual plane take up to 2 16-bit sequences (they're detectible because they're always 16 bit sequences in the range U+D800 to U+DFFF), some mistaken UTF-8 implementations, usually referred to as CESU-8, that convert those UTF-8 sequences into two 3-byte UTF-8 sequences, for a total of six bytes per UTF-32 codepoint. (I believe some early Oracle DB implementations did this, and I'm sure they weren't the only ones).

There's one more minor wrench in things, which is that some glyphs are classified as combining characters, and multiple UTF-16 (or UTF-32) sequences are used when determining what gets displayed on the screen, but I don't think that applies in your case.

Based on your edit, it looks like you're trying to estimate the maximum length of .Net encoding conversion. String Length measures the total number of Chars, which are a count of UTF-16 codepoints. As a worst-case estimate, therefore, I believe you can safely estimate count(Char) * 3, because the non-BMP characters will be count(Char) * 2 yielding 4 bytes as UTF-8.

If you want to get the total number of UTF-32 codepoints represented, you should be able to do something like

var maximumUtf8Bytes = System.Globalization.StringInfo(myString).LengthInTextElements * 4;

(My C# is a bit rusty as I haven't used a .Net environment much in the last few years, but I think that does the trick).

JasonTrue
  • 19,244
  • 4
  • 34
  • 61
  • The UTF-8 expanding system can technically go up to 6 bytes, IIRC, but I don't think any of the specs I found on it were very specific on an official upper limit. It was more of a "we're only using up to 4 so far" kind of deal. – Nyerguds Mar 08 '19 at 08:48
  • 2
    @Nyerguds in the early Unicode days (Unicode 2 or 3) I think the early canonical implementations allowed utf-8 to theoretically expand to 6 bytes if Unicode ever went past 21 bits of assignable codepoints, but it was later formally restricted to 4, presumably because the Unicode standard settled on 21 bits. See also https://stackoverflow.com/questions/9533258/what-is-the-maximum-number-of-bytes-for-a-utf-8-encoded-character – JasonTrue Mar 08 '19 at 09:54
  • I found that emoji can take 4 UTF16 codepoints. 2 to encode emoji itself and 2 to encode skin color for example this emoji requires 4 codepoints in UTF16. So, I'm interesting if these sequences can be longer? How big temporary buffer should I allocate to hold single symbol? Is unichar tmp[4] enough? – dmitry1100 Jul 31 '23 at 16:55
  • 1
    @dmitry1100 Emoji are a complex world but the ones in the surrogate-pair range are still based on the 21-bit assignable range, so you can do the conventional range detection to identify if a character is part of a UTF-16 surrogate pair or not (i.e. U+D800 to U+DBFF) or `char.IsSurrogate(c)`. The UTF-16 emoji itself will result in two 16-bit spans, which in sequence will combine to a single four-byte UTF-8 sequence. The skin color will be another 2 UTF-16 spans -> another 4 byte UTF-8 sequence. https://www.compart.com/en/unicode/U+1F483 The skin color works like combining diacritics do – JasonTrue Aug 01 '23 at 03:20
  • 1
    The skin tone specification works fairly similarly to the idea of combining diacritics (i.e. the decomposition of é into e and the diacritic mark above it) It's essentially a distinct UTF32 codepoint, but the sequence is combined. There's some discussion about how combinations should work with emoji here: https://unicode-org.github.io/unicode-reports/tr52/tr52.html – JasonTrue Aug 01 '23 at 03:23
  • @JasonTrue Thank you for clarification. I was testing NSString - (NSRange)rangeOfComposedCharacterSequenceAtIndex:(NSUInteger)index; to reverse string. When it meets complex emoji with skin color it returns a range to fetch all 4 UTF16 characters. So, I'm wondering how long can be such linked sequences. – dmitry1100 Aug 01 '23 at 20:48
  • 1
    @dmitry1100 I don't believe there's a future-proof answer to that, especially if you consider complications like ZWJ Emoji: https://emojipedia.org/emoji-zwj-sequence – JasonTrue Aug 28 '23 at 06:46