3

So I am trying to make a modification to some software that is written in C# but I am not really a developer. The code reads data from a client and gets values from it. The problem I am seeing is that when you have values from the client that use non english characters it becomes jibberish. The code in question is:

public static string ReadNT(BinaryReader stream)
{
  ret = "";
  byte addByte = 0x00;
  do {
    addByte = ReadByte(stream);
    if (addByte != 0x00)
      ret += (char)addByte;
  } while (addByte != 0x00);
  return ret;
}

As far as I can tell it is going through the stream and converting things to a character one by one to get the string. The problem with that is it doesn't work with unicode/utf8. Is there a way to convert this into a string that works with utf8 values?

  • 1
    You should check out the UTF8Encoding class http://msdn.microsoft.com/en-us/library/system.text.utf8encoding(v=vs.110).aspx – Alex Wiese Nov 15 '12 at 00:37
  • From my (albeit limited) understanding of unicode, I think that you can't guarantee the size of each character. Therefore grabbing them one byte at a time like this will require a lot of workarounds. Your best bet is probably reading the entire stream in one go, and then decoding it. – Dan Nov 15 '12 at 01:05
  • @Dan For UTF8, you need to read it a byte at a type generally as it is variable length. – Cole Tobin Nov 15 '12 at 01:12
  • "Not really a developer": I would STOP right there and not proceed with streams unless you learn some more – Cole Tobin Nov 15 '12 at 01:13

2 Answers2

0

Try this:

public static string ReadNT(BinaryReader stream)
{
    List<byte> bytes = new List<byte>();
    byte addByte = 0x00;

    do
    {
        addByte = ReadByte(stream);

        if (addByte != 0x00)
        {
            bytes.Add((char)addByte);
        }
    } while (addByte != 0x00);

    return Encoding.UTF8.GetString(bytes.ToArray());
}

You can't convert the characters one at a time, as some could be expressed in more than one byte, hence my use of the List<byte> to gather up the whole stream.

I think the big caveat here is that you will need to be sure that the client is sending you UTF8 formatted text.

Edit:

Further to the comments to this answer, from Can UTF-8 contain zero byte?

Yes, the zero byte in UTF8 is code point 0, NUL. There is no other Unicode code point that will be encoded in UTF8 with a zero byte anywhere within it.

Therefore it is safe to assume that if you receive a zero byte, it is NUL and isn't actually part of a code point.

Community
  • 1
  • 1
nick_w
  • 14,758
  • 3
  • 51
  • 71
  • 1
    What if the UTF-8 character's last byte IS `0x00`? As in a two byte (utf8 encoded) character? The 1st bit of 0 states nothing follows and then you have 7 0's for the LAST 7 bits of the DECODED character. – Cole Tobin Nov 15 '12 at 01:12
  • 1
    @ColeJohnson that's a good question - I was concerned about that as well, but in a more general sense, e.g., a `0x00` appearing as the 2nd or 3rd byte in a four-byte character. According to http://en.wikipedia.org/wiki/UTF-8 (the description section), it looks like a multi-byte character won't contain a zero byte. – nick_w Nov 15 '12 at 01:41
0

You could try and use the StreamReader class to read the UTF8 string.

public static string ReadNT(BinaryReader stream)
{
   return (new StreamReader(stream, Encoding.UTF8, false)).ReadString();
}

You should consider transferring the size of the string in addition to the string itself if that is something you have control over.

public static string ReadNT(BinaryReader stream, int length)
{
    return Encoding.UTF8.GetString(stream.ReadBytes(length));
}
Nathan Moinvaziri
  • 5,506
  • 4
  • 29
  • 30