Implementing DbDataReader.GetChars() efficiently when underlying data is not UTF-16

Question

I need to implement DbDataReader.GetChars() for an ADO.NET provider, with the caveat that the data in the cell may not be UTF-16, in fact may be any one of a number of different encodings.

The implementation is specifically for 'long data', and the source data is on the server. The interface I have to the server (which cannot realistically be changed) is to request a range of bytes for the cell. The server does not interpret these bytes in any way, it is simply binary data for it.

I can special-case UTF-16LE and UTF-16BE with obvious implementations, but for other encodings, there is no direct way to translate the request "get me UTF-16 codeunits X to X + Y" into the request "get me bytes X' to X' + Y' in encoding Z".

Some 'requirements' that eliminate obvious implementations:

I do not wish to retrieve all data for a given cell to the client at any one time, unless is it necessary. The cells may be very large, and an application asking for a few kilobytes shouldn't have to deal with hundreds of megs of memory to be allocated to satisfy the request.
I wish to support the random-access exposed by GetChars() relatively efficiently. In the case of the first request asking for codeunits 1 billion to 1 billion + 10, I don't see any way of avoiding retrieving all data in the cell from the server up until the requested codepoints, but subsequently asking for codeunits 1 billion + 10 to 1 billion + 20, or even codepoints 999 million 999 thousand to 1 billion should not imply re-retrieving all that data.

I'm guessing that the great majority of applications won't actually access long-data cells 'randomly', but it would be nice to avoid horrible performance if one did, so if I can't find a relatively easy way to support it, I suppose I'll have to give it up.

My idea was to keep a mapping of #{UTF-16 code units} -> #{bytes of data in server encoding}, updating it as I retrieved data from the cell, and using it to find a 'close' place to start requesting data from the server (rather than retrieving from the beginning every time. On a side note, the lack of something similar to C++'s std::map::lower_bound in the .NET framework frustrates me quite a bit.). Unfortunately, I found it very difficult to generate this mapping!

I've been trying to use the Decoder class, specifically Decoder.Convert() to convert the data piecemeal, but I can't figure out how to reliably tell that a given number of bytes of the source data maps to exactly X UTF-16 codeunits, as the 'bytesUsed' parameter seems to include source bytes which were just stashed into the object's internal state, and not output as Chars. This causes me problems in trying to decode starting from or ending in the middle of a partial codepoint and giving me garbage.

So, my question is, is there some 'trick' I can use to accomplish this (figuring out the exact mapping of #bytes to #codeunits, without resorting to something like converting in a loop, decreasing the size of the source byte-by-byte)?

score 1 · Answer 1 · edited May 23 '17 at 12:29

Do you know which encodings may be supplied by your server? I ask because some encodings are "stateful", which means that the meaning of a given byte may depend on the sequence of bytes that precede it. For instance (source), in the encoding standard ISO 2022-JP, two bytes of 0x24 0x2c may mean a Japanese Hiragana character 'GA'(が) or two ASCII character of '$' and ',' according to the "shift state' -- the presence of a preceding control sequence. In several pre-unicode "Shift-JIS" Japanese encodings, these shift states can appear anywhere in the string and will apply to all subsequent characters until a new shift control sequence is encountered. In the worst case, according to this site, "Often, character boundaries can be detected reliably only by reading the non-Unicode text linearly from the beginning".

Even the UTF-16 encoding used by c#, which is notionally stateless, is more complicated than is generally realized due to the presence of surrogate pairs and combining characters. Surrogate pairs are pairs of char's that together specify a given character such as ; these are required because there are more than ushort.MaxValue unicode code points. Combining characters are sequences of diacritical marks applied to preceding characters, such as in the string "Ĥ=T̂+V̂". And of course these can coexist, albeit unbeautifully: ̂ , which means that a single abstract UTF-16 "text element" can be made up of one or two "base" characters plus some number of diacriticals or other combining characers. All of these make up just one single character from the point of view of the user, and so should never be split or orphaned.

So the general algorithm would be, when you want to fetch N characters from the server starting at offset K, to fetch N+E starting at K-E for some "large enough" E, then scan backwards until the first text element boundary is found. Sadly, for UTF-16, Microsoft doesn't provide an API to do this directly, one would need to reverse-engineer their method

internal static int GetCurrentTextElementLen(String str, int index, int len, ref UnicodeCategory ucCurrent, ref int currentCharCount)

In StringInfo.cs.

A bit of nuisance, but doable.

For other, stateful, encodings, I would not know how to do this, and the logic of scanning backwards to find the first character boundary would be specific to each encoding. For encodings like those in the Shift-JIS family you may well need to scan back arbitrarily far.

Not really an answer but way too long for a comment.

Update

You might try your algorithm for all single-byte encodings. There are 95 such encodings on my computer:

        var singleByteEncodings = Encoding.GetEncodings().Where((enc) => enc.GetEncoding().IsSingleByte).ToList();  // 95 found.
        var singleByteEncodingNames = Encoding.GetEncodings().Where((enc) => enc.GetEncoding().IsSingleByte).Select((enc) => enc.DisplayName).ToList();  // 95 names displayed.
        Encoding.GetEncoding("iso-8859-1").IsSingleByte // returns true.

This might be useful in practice because a lot of older databases only support single-byte character encodings, or do not have multibyte characters enabled for their tables. The default character encoding for a SQL Server database is iso_1 a.k.a ISO 8859-1, for instance. But see this caution from a Microsoft blogger:

Use IsSingleByte() to try to figure out if an encoding is a single byte code page, however I'd really recommend that you don't make too many assumptions about encodings. Code that assumes a 1 to 1 relationship and then tries to seek or back up or something is likely to get confused, encodings aren't conducive to that kind of behavior. Fallbacks, decoders and encoders can change the byte count behavior for individual calls and encodings can sometimes do unexpected things.

score 0 · Accepted Answer · answered Aug 13 '14 at 21:25

0

I figured out how to deal with potentially losing conversion state: I keep a copy of the Decoder around in my mapping to use when restarting from the associated offset. This way I don't lose any partial codepoints it was keeping around in its internal buffers. This also lets me avoid adding encoding-specific code, and deals with potential problems with encodings such as Shift-JIS that dbc brought up.

Decoder is not cloneable, so I use serialization + deserialization to make the copy.

answered Aug 13 '14 at 21:25

Bwmat

4,314
3
27
42

Turns out there is what seems to be a bug in the UTF-32 decoder: serializing it and deserializing it seems to clear its internal state (contrary to what the documentation says). sigh... – Bwmat Aug 28 '14 at 20:35

Implementing DbDataReader.GetChars() efficiently when underlying data is not UTF-16

2 Answers2