8

i am writing text to a TextWriter. i want the UTF-16 Byte Order Mark (BOM) to appear in the output:

public void ProcessRequest(HttpContext context)
{
   context.Response.ContentEncoding = new UnicodeEncoding(true, true);
   WriteStuffToTextWriter(context.Response.Output);
}

Except the output doesn't contain a byte order mark:

HTTP/1.1 200 OK
Server: ASP.NET Development Server/10.0.0.0
Date: Thu, 06 Sep 2012 21:09:23 GMT
X-AspNet-Version: 4.0.30319
Content-Disposition: attachment; filename="Transactions_Calendar_20120906.csv"
Cache-Control: private
Content-Type: text/csv; filename="Transactions_Calendar_20120906.csv"; charset=utf-16BE
Content-Length: 95022
Connection: Close

JobName,ShiftName,6////09////2012 12::::00::::00 АΜ,...

How do i tell a TextWriter to write the encoding marker?

Note: The 2nd paramter in UnicodeEncoding:

   context.Response.ContentEncoding = new UnicodeEncoding(true, true);

byteOrderMark
Type: System.Boolean
true to specify that a Unicode byte order mark is provided; otherwise, false.

Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
  • what exactly is `WriteStuffToTextWriter` you probably have to specify the encoding there in your `StreamWriter` – Stan R. Sep 06 '12 at 21:21
  • 1
    What makes you say that it doesn't contain a BOM with the code you have? – Jon Hanna Sep 06 '12 at 21:22
  • I'm with @JonHanna. Also, have you tried creating a console app and writing the same stuff directly to a file and see what that looks like? After all, a lof *stuff* happens between your web server and your browser. – aquinas Sep 06 '12 at 21:35
  • A console app should hide the BOM too, the whole point of the BOM is that it doesn't appear as part of the text, but gives data on who to decode it from octets into text. A hex view of the stream above though would show an FE and FF or an FF and FE (the order of those bytes being precisely the what the Byte Order Mark is meant to reveal, as U+FFFE isn't a valid character so only one order can be correct). Fiddler has a hex view. – Jon Hanna Sep 06 '12 at 21:38

2 Answers2

13

Short Version

String zwnbsp = "\xfeff"; //Zero-width non-breaking space

//The Zero-width non-breaking space character ***is*** the Byte-Order-Mark (BOM).
String s = zwnbsp+"The quick brown fox jumped over the lazy dog.";
writer.Write(s);

Long Version

At some point i realized how simple the solution is.

i used to think that the Unicode Byte-Order-Mark was some special signature. i used to think i had to carefully decide which byte sequence i wanted to output, in order to output the correct BOM:

  • 0xFE 0xFF
  • 0xFF 0xFE
  • 0xEF 0xBB 0xBF

But since then i realized that byte Byte-Order-Mark is not some special byte sequence that you have to prepend to your file.

The BOM is just a Unicode character. You don't output any bytes; you only output character U+FEFF. The very act of writing that character, the serializer will convert it to whatever encoding you're using for you.

The character U+feff (ZERO WIDTH NO-BREAK SPACE) was chosen for good reason. It's a space, so it has no meaning, and it is zero width, so you shouldn't even see it.

That means that my question is fundamentally flawed. There is no such thing as "writing a byte-order-mark". You just make sure the first character you write out is U+FEFF. In my case i am writing to a TextWriter:

void WriteStuffToTextWriter(TextWriter writer)
{
   String csvExport = GetExportAsCSV();

   writer.Write("\xfeff"); //Output unicode charcter U+FEFF as a byte order marker
   writer.Write(csvExport);
}

The TextWriter will handle converting the unicode character U+feff into whatever byte encoding it has been configured to use.

Note: Any code is released into the public domain. No attribution required.

Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
0

Write out context.Response.ContentEncoding.GetPreamble(). Take a look at Write text files without Byte Order Mark (BOM)?

Community
  • 1
  • 1
dvallejo
  • 1,033
  • 11
  • 25
  • Careful though. I'm not sure that they aren't actually outputting a BOM already. A second U+FEFF would be interpreted as a zero-width no-break space at the start of the actual text, after the BOM. – Jon Hanna Sep 06 '12 at 21:27