What is the formula for determining the maximum number of UTF-8 bytes required to encode a given number of UTF-16 code units (i.e. the value of String.Length
in C# / .NET)?
I see 3 possibilities:
# of UTF-16 code units x 2
# of UTF-16 code units x 3
# of UTF-16 code units x 4
A UTF-16 code point is represented by either 1 or 2 code units, so we just need to consider the worst case scenario of a string filled with one or the other. If a UTF-16 string is composed entirely of 2 code unit code points, then we know the UTF-8 representation will be at most the same size, since the code points take up a maximum of 4 bytes in both representations, thus worst case is option (1) above.
So the interesting case to consider, which I don't know the answer to, is the maximum number of bytes that a single code unit UTF-16 code point can require in UTF-8 representation.
If all single code unit UTF-16 code points can be represented with 3 UTF-8 bytes, which my gut tells me makes the most sense, then option (2) will be the worst case scenario. If there are any that require 4 bytes then option (3) will be the answer.
Does anyone have insight into which is correct? I'm really hoping for (1) or (2) as (3) is going to make things a lot harder :/
UPDATE
From what I can gather, UTF-16 encodes all characters in the BMP in a single code unit, and all other planes are encoded in 2 code units.
It seems that UTF-8 can encode the entire BMP within 3 bytes and uses 4 bytes for encoding the other planes.
Thus it seems to me that option (2) above is the correct answer, and this should work:
string str = "Some string";
int maxUtf8EncodedSize = str.Length * 3;
Does that seem like it checks out?