3

In my .NET 3.5 C# application I'm converting a unicode encoded byte array to a string.

The byte array is as follows:

{255, 254, 85, 0, 83, 0, 69, 0}

Using Encoding.Unicode.GetString(var), I convert the byte array to a string, which returns:

{65279 '', 85 'U', 83 'S' , 69 'E'}

The leading character, 65279, seems to be a Zero Width No-Break Space, which is used as a Byte Order Mark in Unicode encoding, and its appearance is causing problems in the rest of my application.

Currently the workaround I'm using is var.Trim(new char[]{'\uFEFF','\u200B'});, which works just fine.

But the question really is, shouldn't GetStringtake care of removing the byte order mark? Or am I doing something wrong when converting the byte array?

Cristiano Sousa
  • 934
  • 1
  • 6
  • 31
  • 1
    @bzlm: _"Encoding.Unicode will likely return an UTF-16 encoder"_ -- no "likely" about it. It had _better_ do so, given that's what it's documented to do: ["An encoding for the UTF-16 format using the little endian byte order."](https://msdn.microsoft.com/en-us/library/system.text.encoding.unicode(v=vs.110).aspx) – Peter Duniho Mar 30 '15 at 19:43
  • 1
    Why are you trimming `\u200B`? – xanatos Mar 30 '15 at 19:51

1 Answers1

2

No, GetString() should not be removing the BOM. The BOM is actually a perfectly valid Unicode character (selected specifically because if it appears in the middle of a Unicode file, e.g. if the file was the result of concatenating multiple Unicode files, it won't affect the rendered text) and must be decoded along with all other characters in the byte[].

The only code that ought to be interpreting and filtering out the BOM would be code that understands the data is coming from some persistent storage, e.g. StreamReader. And note that it will do that only if you don't disable that behavior.

All that GetString() should do is interpret the actual encoded characters and convert them to the text they represent (of course, in C# strings are stored internally as UTF16, so there's very little to that conversion when the original data is already in UTF16 :) ).

bzlm
  • 9,626
  • 6
  • 65
  • 92
Peter Duniho
  • 68,759
  • 7
  • 102
  • 136
  • But where does the BOM come from? Looking at the byte array I do not see its binary representation. – Cristiano Sousa Mar 30 '15 at 19:46
  • 1
    @CristianoSousa 255 254 is the "BOM", i.e. the space. Or did you mean something else? – bzlm Mar 30 '15 at 19:48
  • @CristianoSousa: as commenter bzlm says, it's in your original data. `255` == `0xff` and `254` == `0xfe`. So the first two bytes in this little-endian UTF16 encoding resolve to `0xfeff`, or `65279` decimal. Just as you see in the decoded text. – Peter Duniho Mar 30 '15 at 19:51