0

Consider this code

using var mem = new MemoryStream();
await using var writer = new StreamWriter(mem, Encoding.UTF8);

await writer.WriteLineAsync("Test");
await writer.FlushAsync();
mem.Position = 0;

Then this code throws

var x = Encoding.UTF8.GetString(mem.ToArray());
if (x[0] != 'T') throw new Exception("Bom is present in string");

Becaus BOM is present. Which doesnt make sense since GetString should decode the stream to decoded string.

This code works as intended and does not include the BOM

using var reader = new StreamReader(mem, Encoding.UTF8);
var x = await reader.ReadToEndAsync();
if (x[0] != 'T') throw new Exception("Bom is present in string");

Anyone know Microsofts reasoning about this? To me it seems strange to keep a BOM in a method called GetString.

Anders
  • 17,306
  • 10
  • 76
  • 144

1 Answers1

0

It's important to remember that the Encoding class only deals with the encodingn, not streams, files or packets. GetString converts the full or partial contents of a byte buffer into a Unicode string. It may be called on the entire buffer, or it may be called on just a part of it with GetString (byte[] bytes, int index, int count);

GetString neither generates nor handles BOM bytes. The bytes were emitted by StreamWriter because the encoding used explicitly specifies it. The StreamWriter.Flush() source code shows that the method explicitly emits the output of Encoding.GetPreamle() to the stream :

if (preamble.Length > 0)
    stream.Write(preamble, 0, preamble.Length);

GetBytes generates the bytes for the actual string contents. Its inverse, GetString doesn't handle BOMs either, those are handled by the StreamReader class or any custom code that reads raw bytes.


From the Encoding.UTF8 property remarks:

The UTF8Encoding object that is returned by this property might not have the appropriate behavior for your app.

  • It returns a UTF8Encoding object that provides a Unicode byte order mark (BOM). To instantiate a UTF8 encoding that doesn't provide a BOM, call any overload of the UTF8Encoding constructor.

StreamWriter uses UTF8 without BOM when no encoding is specified, both in .NET Framework and .NET Core :

This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM), so its GetPreamble method returns an empty byte array.

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
  • My question is why does GetString return a string with BOM a string is a decoded representation. There is no logical reason why a string would contain a BOm it should be fully decoded by UTF8.GetString a string should not contain encoding details. It should be fully decoded – Anders Jul 07 '23 at 12:52
  • The first byte is *not* a BOM, it's `0xFF`. Something else is going on. `(x[0] != 'T')` doesn't check for a BOM, it checks that the first character isn't `T` – Panagiotis Kanavos Jul 07 '23 at 13:09
  • Ok maybe its not a BOM byt its a preamble that should be there. x[0]= 65279 – Anders Jul 07 '23 at 13:17
  • 65279 is BOM btw – Anders Jul 07 '23 at 13:19
  • The BOM for UTF8 is 3 bytes, not 65279. It's not a single character. That byte isn't part of a a BOM, and nobody noticed the discrepancy for more than 20 years. Something else is going on. *Without* the MemoryStream, there's no problem – Panagiotis Kanavos Jul 07 '23 at 13:23
  • Ok there is stack overflow posts saying that 65279 is a common byte order mark. – Anders Jul 07 '23 at 13:25
  • You can try the code if you want it will have the same output. We use above code to serialize a object to xml in memory and save the string to sql. Thats how we noticed the problem. FIrst we thought it was a encoding problem with sql column. But the problem happens when the string is created by GetString – Anders Jul 07 '23 at 13:27
  • That's wrong. From the [Wikipedia UTF8 BOM paragraph](https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8) *The UTF-8 representation of the BOM is the (hexadecimal) byte sequence EF BB BF.*. You'll have to translate those bytes as a 32-bit int in a *specific* order to get 65279 – Panagiotis Kanavos Jul 07 '23 at 13:28
  • @Anders I did, that's how I found out the byte is `0xFF`, not a BOM byte, and that only the *MemoryStream* causes this problem. Using `Encoding.GetBytes` and `GetString` doesn't cause the extra `0xFF`. This isn't a BOM issue, this is a stream issue, possibly reading past the end of the "stream". A `MemoryStream` isn't a real stream, it's a wrapper over a `byte[]`. – Panagiotis Kanavos Jul 07 '23 at 13:29
  • @Anders what we both missed is that `Encoding.GetString` isn't the reverse of `StreamWriter` but `GetBytes`. Neither of these deals with BOM bytes, that's what the `Encoding.GetPreamle()` method is for. In the question's code, `GetString` is asked to deal with a value it wasn't meant to handle. `0xFF` is an error character – Panagiotis Kanavos Jul 07 '23 at 13:41
  • I tried using FIle streams , same outcome. SO its not memory streams. I just read your above comment. Pretty strange that GetString on UTF8 cant deal with BOM. Makes GetString on UTF8 pretty useless. To me it would be more obvious if it decoded the content of byte array into the correct string knowing that the byte array contains UTF8 data – Anders Jul 07 '23 at 13:49
  • I explained what's going on at the top of the answer. `GetBytes/GetString` don't deal with preambles at all. Using them alone to handle file/stream content directly is wrong. `Makes GetString on UTF8 pretty useless` quite the opposite - they're used for over 20 years by dozens of millions of developers, for *every* possible kind of application, from the smallest to critical ones. Which means we, both, got it wrong – Panagiotis Kanavos Jul 07 '23 at 13:51
  • I still think its a strange implemanation since i call GetString on the UTF8 instance. If it atleast could throw exception telling me what the problem is. Not clean code if you ask me. – Anders Jul 07 '23 at 13:53
  • No it's not. Because those methods are supposed to deal with *any* bytes, whether they were loaded form the middle, end or start of a large buffer or file. What if you were loading bytes from an HTTP response? A long-running one, like streaming JSON? Would you wait until you retrieved *all* the content? Or would you try to decode it batch by batch? – Panagiotis Kanavos Jul 07 '23 at 13:55
  • Why should the method that handles text bytes also handle the BOM? If it did, what method would handle the *next* set of bytes? How would you call *that* method? Remember, `Encoding` deals with the encoding itself, not the streams, readers or writers that use the encoding. It's not `File.ReadAllText` – Panagiotis Kanavos Jul 07 '23 at 13:56
  • Exacatly the encoing deals with encoding. And BOM is part of the encoding. Its not intuitive that GetString on a UTF8 overload cant handle this. But I guess we can live with it. – Anders Jul 07 '23 at 14:00