3

I need to read a string from a sequence of bytes which is UTF-8. The source of these bytes come in in separate read operations, which won't respect character boundaries, so I cannot use System.Text.Encoding.UTF8.GetString. But, the System.Text.Decoder class, as returned by System.Text.Encoding.UTF8.GetDecoder() appears to be designed for this scenario. One of the OUT arguments looks like it should indicate when a character has only been partially read.

The documentation for Convert (at https://msdn.microsoft.com/en-us/library/h6w985hz(v=vs.110).aspx) suggests that the completed value should be false, if either the output ( char[] ) buffer was too small, or not all the bytes could be converted. See Remarks paragraph 4.

However, the completed value appears to be TRUE even when the docs says it should be false, when the bytes of a character have not been completely converted.

I presume I'm doing something wrong (or this is a bug ??), and if so, how can I detect if my byte stream is paused in the middle of a character ?

demonstration code:

const int outSize = 10;
char[] outBuf = new char[outSize];
byte[] frag1 = new byte[] { 0xE7 };
byte[] frag2 = new byte[] { 0x95, 0xA2 };
var decoder = System.Text.Encoding.UTF8.GetDecoder();
int bytesUsed, charsUsed; bool completed;

// the first byte of the UTF-8 character
decoder.Convert(frag1, 0, frag1.Length, outBuf, 0, outSize, false, out bytesUsed, out charsUsed, out completed);
Debug.Assert( bytesUsed == 1 );
Debug.Assert( charsUsed == 0 );

// // // // // // // // // // // //  completed is true here, but WHY ?
Debug.Assert( ! completed);
// // // // // // // // // // // // 

// the second and third bytes of the UTF-8 character
decoder.Convert(frag2, 0, frag2.Length, outBuf, 0, outSize, false, out bytesUsed, out charsUsed, out completed);
Debug.Assert(bytesUsed == 2);
Debug.Assert(charsUsed == 1);
Debug.Assert(completed);
Debug.Assert( new String(outBuf, 0, 1 ) == "畢" );

Thanks!

William
  • 690
  • 5
  • 13
  • 1
    The comment in the [source code](https://referencesource.microsoft.com/#mscorlib/system/text/decodernls.cs,f7d6515ff5dfceae) reads, "*Its completed if they've used what they wanted AND if they didn't want flush or if we are flushed*", `completed = (bytesUsed == byteCount) && (!flush || !this.HasState) && (m_fallbackBuffer == null || m_fallbackBuffer.Remaining == 0)`. So you're getting `true` because you're passing `false` for `bool flush`. – GSerg Mar 27 '18 at 19:31
  • Effectively you want to know when [Utf8Decoder.HasState](https://referencesource.microsoft.com/#mscorlib/system/text/utf8encoding.cs,cc89a97838d78451) is `true`. Unfortunately, it's `internal`, and as you can see it does not participate in calculating `bool completed`. Try to [call it via reflection](https://stackoverflow.com/q/135443/11683)? – GSerg Mar 27 '18 at 19:52

0 Answers0