Getting a unicode string from a raw TCP stream in C#

Question

So I am trying to make a modification to some software that is written in C# but I am not really a developer. The code reads data from a client and gets values from it. The problem I am seeing is that when you have values from the client that use non english characters it becomes jibberish. The code in question is:

public static string ReadNT(BinaryReader stream)
{
  ret = "";
  byte addByte = 0x00;
  do {
    addByte = ReadByte(stream);
    if (addByte != 0x00)
      ret += (char)addByte;
  } while (addByte != 0x00);
  return ret;
}

As far as I can tell it is going through the stream and converting things to a character one by one to get the string. The problem with that is it doesn't work with unicode/utf8. Is there a way to convert this into a string that works with utf8 values?

You should check out the UTF8Encoding class http://msdn.microsoft.com/en-us/library/system.text.utf8encoding(v=vs.110).aspx — Alex Wiese, Nov 15 '12 at 00:37
From my (albeit limited) understanding of unicode, I think that you can't guarantee the size of each character. Therefore grabbing them one byte at a time like this will require a lot of workarounds. Your best bet is probably reading the entire stream in one go, and then decoding it. — Dan, Nov 15 '12 at 01:05
@Dan For UTF8, you need to read it a byte at a type generally as it is variable length. — Cole Tobin, Nov 15 '12 at 01:12
"Not really a developer": I would STOP right there and not proceed with streams unless you learn some more — Cole Tobin, Nov 15 '12 at 01:13

score 0 · Answer 1 · edited May 23 '17 at 12:18

Try this:

public static string ReadNT(BinaryReader stream)
{
    List<byte> bytes = new List<byte>();
    byte addByte = 0x00;

    do
    {
        addByte = ReadByte(stream);

        if (addByte != 0x00)
        {
            bytes.Add((char)addByte);
        }
    } while (addByte != 0x00);

    return Encoding.UTF8.GetString(bytes.ToArray());
}

You can't convert the characters one at a time, as some could be expressed in more than one byte, hence my use of the List<byte> to gather up the whole stream.

I think the big caveat here is that you will need to be sure that the client is sending you UTF8 formatted text.

Edit:

Further to the comments to this answer, from Can UTF-8 contain zero byte?

Yes, the zero byte in UTF8 is code point 0, NUL. There is no other Unicode code point that will be encoded in UTF8 with a zero byte anywhere within it.

Therefore it is safe to assume that if you receive a zero byte, it is NUL and isn't actually part of a code point.

What if the UTF-8 character's last byte IS `0x00`? As in a two byte (utf8 encoded) character? The 1st bit of 0 states nothing follows and then you have 7 0's for the LAST 7 bits of the DECODED character. — Cole Tobin, Nov 15 '12 at 01:12
@ColeJohnson that's a good question - I was concerned about that as well, but in a more general sense, e.g., a `0x00` appearing as the 2nd or 3rd byte in a four-byte character. According to http://en.wikipedia.org/wiki/UTF-8 (the description section), it looks like a multi-byte character won't contain a zero byte. — nick_w, Nov 15 '12 at 01:41

Nathan Moinvaziri · Answer 2 · 2012-11-15T01:59:33.443

0

You could try and use the StreamReader class to read the UTF8 string.

public static string ReadNT(BinaryReader stream)
{
   return (new StreamReader(stream, Encoding.UTF8, false)).ReadString();
}

You should consider transferring the size of the string in addition to the string itself if that is something you have control over.

public static string ReadNT(BinaryReader stream, int length)
{
    return Encoding.UTF8.GetString(stream.ReadBytes(length));
}

edited Nov 15 '12 at 01:59

answered Nov 15 '12 at 01:40

Nathan Moinvaziri

5,506
4
29
30

Not what the OP specifically asked, but I like it – Cole Tobin Nov 15 '12 at 02:36

Getting a unicode string from a raw TCP stream in C#

2 Answers2