What am I supposed to do in .NET with a UTF8 Encoded string?

Question

I am using the Google Chrome Native Messaging which says that it supplies UTF8 encoded JSON. Found here.

I am pretty sure my code is fairly standard and pretty much a copy from answers here in C#. For example see this SO question.

Private Function OpenStandardStreamIn() As String
    Dim MsgLength As Integer = 0
    Dim InputData As String = ""
    Dim LenBytes As Byte() = New Byte(3) {} 'first 4 bytes are length

    Dim StdIn As System.IO.Stream = Console.OpenStandardInput() 'open the stream
    StdIn.Read(LenBytes, 0, 4) 'length
    MsgLength = System.BitConverter.ToInt32(LenBytes, 0) 'convert length to Int

    Dim Buffer As Char() = New Char(MsgLength - 1) {} 'create Char array for remaining bytes

    Using Reader As System.IO.StreamReader = New System.IO.StreamReader(StdIn) 'Using to auto dispose of stream reader
        While Reader.Peek() >= 0 'while the next byte is not Null
            Reader.Read(Buffer, 0, Buffer.Length) 'add to the buffer
        End While
    End Using

    InputData = New String(Buffer) 'convert buffer to string

    Return InputData
End Function

The problem I have is that when the JSON includes characters such as ß Ü Ö Ä then the whole string seems to be diffent and I cannot deserialize it. It is readable and my log shows the string is fine, but there is something different. As long as the string does NOT include these characters then deserialization works fine. I am not supplying the JavascriptSerializer code as this is not the problem.

I have tried creating the StreamReader with different Encodings such as

New System.IO.StreamReader(StdIn, Encoding.GetEncoding("iso-8859-1"), True)

however the ß Ä etc are then not correct.

What I don't understand is if the string is UTF8 and .NET uses UTF16 how am I supposed to make sure the conversion is done properly?

UPDATE

Been doing some testing. What I have found is if I receive a string with fuß then the message length (provided by native messaging) is 4 but number of Char in the buffer is 3, if the string is fus then the message length is 3 and number of characters is 3. Why is that?

With the above code the Buffer object is 1 too big and thus is why there is a problem. If I simple use the Read method on the stream then it works fine. It appears that Google Messaging is sending a message length that is different when the ß is in the string.

If I want to use the above code then how can I know that the message length is not right?

If the string is in UTF8 then it's a binary blob and not a proper instance of `string` data type. If there is a stream that is encoded in UTF8, pass `Encoding.UTF8` to the `StreamReader` constructor, as opposed to passing nothing like in your first example or passing something else like in the second. — GSerg, Sep 22 '19 at 13:06
`Why is that?` - because one Unicode codepoint may be represented with [several `char`s](https://stackoverflow.com/q/14115503/11683). — GSerg, Sep 23 '19 at 08:28
It appears that for each Character such as ß or ä native messaging counts them as 2. Thus the message length that Native Messaging returns for fuß is 4 and for fußß is 6. Thus using the above code the Buffer object will have VBNullChar values in it. This is the problem. Currently using the Read method of the stream and reading each char seems to work. — darbid, Sep 23 '19 at 09:18
Then don't use that method. Use the [overload of `StreamReader.Read`](https://learn.microsoft.com/en-us/dotnet/api/system.io.streamreader.read?view=netframework-4.8#System_IO_StreamReader_Read_System_Char___System_Int32_System_Int32_) that already will only read up to the specified number of characters. — GSerg, Sep 23 '19 at 09:57
Yes that is what I am doing. But am hoping to understand the length value Native Messaging is sending. — darbid, Sep 23 '19 at 10:14
You are not doing that. You are overwriting your buffer over and over again, until there is no more to read in the stream. Remove your `While` loop and call `Read(Buffer, 0, Buffer.Length)` once. The length you are receiving is the number of `Char`s you must have in the buffer. — GSerg, Sep 23 '19 at 10:19

score 1 · Accepted Answer · answered Sep 29 '19 at 00:20

"Each message is serialized using JSON, UTF-8 encoded and is preceded with 32-bit message length in native byte order. The maximum size of a single message from the native messaging host is 1 MB." This implies that the message length is in bytes, also, that the length is not part of the message (and so its length is not included in length).

Your confusion seems to stem from one of two things:

UTF-8 encodes a Unicode codepoint in 1 to 4 code units. (A UTF-8 code unit is 8 bits, one byte.)
Char is a UTF-16 code unit. (A UTF-16 code unit is 16 bits, two bytes. UTF-16 encodes a Unicode codepoint in 1 to 2 code units.)

There is no way to tell how many codepoints or UTF-16 code units are in the message until after it is converted (or scanned, but then you might as well just convert it).

Then, presumably, stream will either be found to be closed or the next thing to read would be another length and message.

So,

Private Iterator Function Messages(stream As Stream) As IEnumerable(Of String)
    Using reader = New BinaryReader(stream)
        Try                
            While True
                Dim length = reader.ReadInt32
                Dim bytes = reader.ReadBytes(length)
                Dim message = Encoding.UTF8.GetString(bytes)
                Yield message
            End While
        Catch e As EndOfStreamException
            ' Expected when the sender is done
            Return
        End Try
    End Using
End Function

Usage

Messages(stream).ToList()

or

For Each message In Messages(stream)
    Debug.WriteLine(message)            
Next message

score 0 · Answer 2 · answered Sep 22 '19 at 15:19

if you're displaying the output of this code in a console, this would diffidently happen. because windows console doesn't display Unicode characters. if this wasn't the case, then try to use a string builder to convert the data inside your StdIn stream to a string

What am I supposed to do in .NET with a UTF8 Encoded string?

2 Answers2