How to GetBytes() in C# with UTF8 encoding with BOM?

Question

I'm having a problem with UTF8 encoding in my asp.net mvc 2 application in C#. I'm trying let user download a simple text file from a string. I am trying to get bytes array with the following line:

var x = Encoding.UTF8.GetBytes(csvString);

but when I return it for download using:

return File(x, ..., ...);

I get a file which is without BOM so I don't get Croatian characters shown up correctly. This is because my bytes array does not include BOM after encoding. I triend inserting those bytes manually and then it shows up correctly, but that's not the best way to do it.

I also tried creating UTF8Encoding class instance and passing a boolean value (true) to its constructor to include BOM, but it doesn't work either.

Anyone has a solution? Thanks!

Darin Dimitrov · Accepted Answer · 2010-12-10T23:24:54.043

159

Try like this:

public ActionResult Download()
{
    var data = Encoding.UTF8.GetBytes("some data");
    var result = Encoding.UTF8.GetPreamble().Concat(data).ToArray();
    return File(result, "application/csv", "foo.csv");
}

The reason is that the UTF8Encoding constructor that takes a boolean parameter doesn't do what you would expect:

byte[] bytes = new UTF8Encoding(true).GetBytes("a");

The resulting array would contain a single byte with the value of 97. There's no BOM because UTF8 doesn't require a BOM.

edited Dec 10 '10 at 23:24

answered Dec 10 '10 at 23:11

Darin Dimitrov

1,023,142
271
3,287
2,928

4

Thanks! I was going crazy with my special characters not working in Excel CSV :) – Hannes Sachsenhofer Jun 11 '13 at 14:23
4

For clarity, `Encoding.UTF8` is equivalent to `new UTF8Encoding(true)`. The parameter controls whether `GetPreamble()` will emit a BOM. – user247702 Sep 16 '14 at 08:45
9

There's no BOM because `GetBytes` can't assume we're writing to a file. Whoever writes to the file should do the preamble thing first (like a StreamWriter, for example). – Dave Van den Eynde Nov 26 '14 at 14:06
2

Why content type is set to "application/csv" instead of "text/csv" (as shown [here](http://www.freeformatter.com/mime-types-list.html))? In any case, neither way works, here. Excel still opens it with unrecognizable characters. – Veverke May 12 '15 at 16:39
The MIME type should be: `text/csv`, [see here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Complete_list_of_MIME_types) (and if you want to be more precise then use: `text/csv; charset=utf-8`, [see here](https://www.w3.org/International/articles/http-charset/index#charset)). – Ofir Mar 13 '18 at 10:00
1

If I use contentType of `application/csv` it works fine, but if I replace it with `text/csv` it stops working, maybe someone has a clue why is that? – Ramūnas Mar 20 '19 at 15:26
I was having this issues as well and this is the only solution that worked for me, there are other suggestions about telling the user to change the encoding but that doesn't work when you have thousands of users complaining of random encoding issues, customer never read the instructions is better to provide the file in the format that will work as expected. – Goca Oct 22 '20 at 20:12

score 22 · Answer 2 · answered Jun 15 '15 at 07:28

22

I created a simple extension to convert any string in any encoding to its representation of byte array when it is written to a file or stream:

public static class StreamExtensions
{
    public static byte[] ToBytes(this string value, Encoding encoding)
    {
        using (var stream = new MemoryStream())
        using (var sw = new StreamWriter(stream, encoding))
        {
            sw.Write(value);
            sw.Flush();
            return stream.ToArray();
        }
    }
}

Usage:

stringValue.ToBytes(Encoding.UTF8)

This will work also for other encodings like UTF-16 which requires the BOM.

answered Jun 15 '15 at 07:28

Hovhannes Hakobyan

1,266
12
14

1

This is actually a very useful workaround. The use of a `StreamWriter`, with encoding, solved my immediate problem and allowed my file to be opened with Excel 2013. – iCollect.it Ltd Jun 29 '15 at 10:00
Thanks. It`s helped me to save .csv with arabic characters. Using Encoding.GetBytes returned bad file, with unknown characters. – Markomar Jun 25 '20 at 09:48

score 2 · Answer 3 · answered Dec 10 '10 at 23:11

2

UTF-8 does not require a BOM, because it is a sequence of 1-byte words. UTF-8 = UTF-8BE = UTF-8LE.

In contrast, UTF-16 requires a BOM at the beginning of the stream to identify whether the remainder of the stream is UTF-16BE or UTF-16LE, because UTF-16 is a sequence of 2-byte words and the BOM identifies whether the bytes in the words are BE or LE.

The problem does not lie with the Encoding.UTF8 class. The problem lies with whatever program you are using to view the files.

answered Dec 10 '10 at 23:11

yfeldblum

65,165
12
129
169

1

UTF-8 is a variable width encoding. It only requires 1 byte to encode ASCII characters, but other code points will use multiple bytes. – Joel Fillmore Jul 12 '11 at 15:47
2

The codepoints encoded with multiple bytes have a pre-defined order (based on the `U+` big-endian representation). However, since UTF8 is represented as a stream of bytes (rather than as a stream of words or dwords which are themselves represented as a sequence of bytes), the concept of endianness doesn't apply. Endianness is applicable to the representation of 16-, 32-, 64-, 128-bit integers as bytes, not to the representation of codepoints as bytes. – yfeldblum Jul 12 '11 at 16:59
Sorry, I thought you were referring to the storage of codepoints with the phrase "sequence of 1 byte words". Thanks for the clarification. +1 for your answer and comment. – Joel Fillmore Jul 12 '11 at 19:16
1

Some programs use it to detect the encoding as being UTF-8. Programs that don't require it should ignore it as the character emitted is something that is to be ignored anyway. It's older programs that can't handle the BOM. – Dave Van den Eynde Nov 26 '14 at 14:04
It does, if you wanna, say, open a UTF-8 file that has surrogate pairs in Visual Studio... – marc hoffman Oct 13 '17 at 12:56
@yfeldblum Sorry, but although I agree with you regarding the lack of encoding recognition in some programs, when the faulty program is something as widespread as Excel 2016 opening CSV files, answers like the ones of Hovhannes Hakobyan or Darin Dimitrov are much more helpful than yours. – AFract Oct 03 '18 at 08:09

score -2 · Answer 4 · answered Dec 10 '10 at 23:12

Remember that .NET strings are all unicode while there stay in memory, so if you can see your csvString correctly with the debugger the problem is writing the file.

In my opinion you should return a FileResult with the same encoding that the files. Try setting the returning File encoding,

How to GetBytes() in C# with UTF8 encoding with BOM?

4 Answers4

Linked