C# non english characters like ü, ç ,é present in SQL Server database is getting translated to Ã¼, Ã§, Ã© in CSV

Question

I am working on a report to move data from table to reports in CSV.

Below is my SQL Server database that has column values as below, in non-english characters such as ü, ç ,é. And i have provided CSV output generated. The code which I am using also I have pasted below. 1st image is database data. 2nd image is Excel report data.

Database screeenshot of non-english character

Below is the CSV output

I have below code written.

public static byte[] GetMemoryStream<T>(List<T> data) where T : class
{
    Logger.Debug("Inside GetMemoryStream(), step 7");
    using (var mem = new MemoryStream())
    using (var writer = new StreamWriter(mem))
    using (var csvWriter = new CsvWriter(writer, System.Threading.Thread.CurrentThread.CurrentCulture)) // System.Threading.Thread.CurrentThread.CurrentCulture
    {                
        var options = new TypeConverterOptions { Formats = new[] { "yyyy/MM/dd HH:mm:ss" } };
        csvWriter.Configuration.TypeConverterOptionsCache.AddOptions<DateTime>(options);
        csvWriter.Configuration.Delimiter = ",";

        csvWriter.WriteHeader<T>();
        csvWriter.NextRecord();
        csvWriter.WriteRecords(data);

        writer.Flush();
        var result = Encoding.Unicode.GetString(mem.ToArray());
        //Console.WriteLine(result);
        return Encoding.Unicode.GetBytes(result);
        // return mem.ToArray();
    }

You need to tell us the datatypes of the columns, and the collation of your database (or the columns if individually collated). — Dale K, Aug 18 '20 at 06:18
Your code seems to maybe assume that `StreamWriter` will be writing using `Encoding.Unicode`. It uses `Encoding.UTF8` by default. The bigger question though is: why when `mem.ToArray()` returns an array of bytes, are you converting it to a string (using the wrong encoding) and back to bytes? Are you trying to convert the encoding system used? Why not just pass `Encoding.Unicode` to the `StreamWriter` constructor? — ProgrammingLlama, Aug 18 '20 at 06:18

score 2 · Accepted Answer · answered Aug 18 '20 at 06:36

I'm not really sure what you're trying to do, but I think I can write an answer that covers all the bases.

First I'll explain what your current code is doing:

using (var mem = new MemoryStream())
using (var writer = new StreamWriter(mem))

var result = Encoding.Unicode.GetString(mem.ToArray());
return Encoding.Unicode.GetBytes(result);

The StreamWriter constructor you're using will set the writer up to encode content with UTF8 encoding:

Initializes a new instance of the StreamWriter class for the specified stream by using UTF-8 encoding and the default buffer size.

mem.ToArray() will return the contents of the memory stream as a byte array (byte[]).

You then take that byte array and try to convert it to a string using Unicode encoding. See What is the difference between UTF-8 and Unicode? Clearly, being different encoding systems, you can't expect the resulting string to be true to the original data you wrote.

Finally, you convert that string into a Unicode-encoded byte[]. Clearly the data here will still be wrong because we started with a bad copy.

Looking at your code I'm assuming one of two things has happened:

You have mistakenly assumed that Unicode == UTF8 (it doesn't).
You're unaware that you can control the encoding used by StreamWriter and you're trying to convert the UTF8-encoded data it generates into a Unicode-encoded file.

Assuming you want UTF8-encoded text, the fix for 1 would be to simply return mem.ToArray() without the conversion via result:

using (var mem = new MemoryStream())
using (var writer = new StreamWriter(mem))

return mem.ToArray();

The fix for 2 would be to simply initialize StreamWriter with the correct encoding (assuming you want Unicode) and then return mem.ToArray():

using (var mem = new MemoryStream())
using (var writer = new StreamWriter(mem, Encoding.Unicode))

return mem.ToArray();

If you want to do a conversion from UTF8 to Unicode, you need to decode the string using UTF8 and then encode it using Unicode:

using (var mem = new MemoryStream())
using (var writer = new StreamWriter(mem, Encoding.UTF8)) // UTF8 is default, but for clarity I've included it explicitly here.

var result = Encoding.UTF8.GetString(mem.ToArray());
return Encoding.Unicode.GetBytes(result);

This will decode the string from the byte[] using UTF8 encoding, and then re-encode it to a byte[] using Unicode encoding.

Below 3 lines of code as suggested by you have halepd me resolve my issue using (var writer = new StreamWriter(mem, Encoding.UTF8)) var result = Encoding.Unicode.GetString(mem.ToArray()); return Encoding.Unicode.GetBytes(result); — Abhijeet Sinha, Aug 21 '20 at 10:47

score -1 · Answer 2 · edited Aug 21 '20 at 11:25

Below is the piece of code that worked for me

(var writer = new StreamWriter(mem, Encoding.UTF8)) 
var result = Encoding.Unicode.GetString(mem.ToArray()); 
return Encoding.Unicode.GetBytes(result);

So now i am not getting any ?'s in excel sheet or other unknown characters which i was getting previously, below is comparison image of my previous code versus my current code. and current code is great

C# non english characters like ü, ç ,é present in SQL Server database is getting translated to Ã¼, Ã§, Ã© in CSV

2 Answers2