1

In a related question about Unicode handling in .Net, John Skeet stated:

If you're happy ignoring surrogate pairs, UTF-16 has some nice properties, basically due to the size per code unit being constant. You know how much space to allocate for a given number of code units…

But how do you know what the codeunit size is, or even whether an encoding is of variable number of codeunits per codepoint?

At first I though that it could be easily determined by calling GetMaxCharCount(nBytes) and GetMaxByteCount(nChars) functions of a System.Text.Encoding instance in question. For example, having 8 input bytes, we will get no more than 8, 4, and 2 decoded characters for ASCII / UTF-8, UTF-16 / UCS-2 and UTF-32 / UCS-4, respectively; yet with 8 input characters we will get 8 bytes for ASCII and some number other than that above for other encodings, which represents their size constancy or variability. However, those functions return hardly useful results:

        MaxChars   MaxBytes
         8 bytes    8 chars
---------------------------
ASCII    8 chars    9 bytes   <--- Leftover chars in ASCII? O_o
UTF-8    9 chars   27 bytes
UTF-16   5 chars   18 bytes
UTF-32   6 chars   36 bytes   <--- More chars than UTF-16? O_o

This behavior is intentional, though, as their documentation clearly says:

Note that GetMaxCharCount considers the worst case for leftover bytes from a previous encoder operation. For most code pages, passing a value of 0 to this method retrieves values greater than or equal to 1. GetMaxCharCount(N) is not necessarily the same value as N * GetMaxCharCount(1).

Note that GetMaxByteCount considers potential leftover surrogates from a previous decoder operation. Because of the decoder, passing a value of 1 to the method retrieves 2 for a single-byte encoding, such as ASCII. You should use the IsSingleByte property if this information is necessary. GetMaxByteCount(N) is not necessarily the same value as N * GetMaxByteCount(1).

That is not so clear is how those (or other?) functions may be applied to the task of determining code unit size dynamically rather than from a hardcoded lookup table for a limited number of encodings? The only viable way I found is “if IsSingleByte then unit size is 1 byte and character size is constant”, but if it was for single-byte encodings only, then this would not be needed at all. So what is the general solution for arbitrary encodings?

Community
  • 1
  • 1
Anton Samsonov
  • 1,380
  • 17
  • 34
  • For `IsSingleByte=False` encodings, the only way I can think of is to loop through every Unicode codepoint (or maybe just a sampling of the more commonly used codepoints) and encode those using `Encoding.GetBytes()` and then analyze the bytes. For most encodings, the encoded forms of the ASCII characters U-0000 - U-007F will tell you the codeunit size (EBCDIC encodings might be weird), and then you can see whether the non-ASCII codepoints encode to larger multiples of that size. – Remy Lebeau Jul 10 '15 at 23:51

0 Answers0