In a related question about Unicode handling in .Net, John Skeet stated:
If you're happy ignoring surrogate pairs, UTF-16 has some nice properties, basically due to the size per code unit being constant. You know how much space to allocate for a given number of code units…
But how do you know what the codeunit size is, or even whether an encoding is of variable number of codeunits per codepoint?
At first I though that it could be easily determined by calling GetMaxCharCount(nBytes)
and GetMaxByteCount(nChars)
functions of a System.Text.Encoding
instance in question. For example, having 8 input bytes, we will get no more than 8, 4, and 2 decoded characters for ASCII / UTF-8, UTF-16 / UCS-2 and UTF-32 / UCS-4, respectively; yet with 8 input characters we will get 8 bytes for ASCII and some number other than that above for other encodings, which represents their size constancy or variability. However, those functions return hardly useful results:
MaxChars MaxBytes
8 bytes 8 chars
---------------------------
ASCII 8 chars 9 bytes <--- Leftover chars in ASCII? O_o
UTF-8 9 chars 27 bytes
UTF-16 5 chars 18 bytes
UTF-32 6 chars 36 bytes <--- More chars than UTF-16? O_o
This behavior is intentional, though, as their documentation clearly says:
Note that
GetMaxCharCount
considers the worst case for leftover bytes from a previous encoder operation. For most code pages, passing a value of 0 to this method retrieves values greater than or equal to 1.GetMaxCharCount(N)
is not necessarily the same value asN * GetMaxCharCount(1)
.Note that
GetMaxByteCount
considers potential leftover surrogates from a previous decoder operation. Because of the decoder, passing a value of 1 to the method retrieves 2 for a single-byte encoding, such as ASCII. You should use theIsSingleByte
property if this information is necessary.GetMaxByteCount(N)
is not necessarily the same value asN * GetMaxByteCount(1)
.
That is not so clear is how those (or other?) functions may be applied to the task of determining code unit size dynamically rather than from a hardcoded lookup table for a limited number of encodings? The only viable way I found is “if IsSingleByte
then unit size is 1 byte and character size is constant”, but if it was for single-byte encodings only, then this would not be needed at all. So what is the general solution for arbitrary encodings?