192

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.

Due to incorrect encoding, a piece of my string looks like this in Spanish:

Acción

whereas it should look like this:

Acción

According to the answer on this question: How to know string encoding in C#, the encoding I am receiving should be coming on UTF-8 already, but it is read on Encoding.Default (probably ANSI?).

I am trying to transform this string into real UTF-8, but one of the problems is that I can only see a subset of the Encoding class (UTF8 and Unicode properties only), probably because I'm limited to the windows surface API.

I have tried some snippets I've found on the internet, but none of them have proved successful so far for eastern languages (i.e. korean). One example is as follows:

var utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(myString);
myString= utf8.GetString(utfBytes, 0, utfBytes.Length);     

I also tried extracting the string into a byte array and then using UTF8.GetString:

byte[] myByteArray = new byte[myString.Length];
for (int ix = 0; ix < myString.Length; ++ix)
{
    char ch = myString[ix];
    myByteArray[ix] = (byte) ch;
}

myString = Encoding.UTF8.GetString(myByteArray, 0, myString.Length);

Do you guys have any other ideas that I could try?

Community
  • 1
  • 1
Gaara
  • 2,117
  • 2
  • 15
  • 15
  • 5
    Your problem is coming from the code that created the string (from a stream or byte[]) in the first place. Please show that code. – SLaks Dec 27 '12 at 15:57
  • 1
    @Oded: .Net strings are stored in-memory as UTF16, but `Encoding.Default` returns the system's ANSI codepage. – SLaks Dec 27 '12 at 16:00
  • Here is an example of a string that doesn't work on English language: instead of displaying day's , my front end app is displaying: day’s – Gaara Dec 28 '12 at 17:17

7 Answers7

323

As you know the string is coming in as Encoding.Default you could simply use:

byte[] bytes = Encoding.Default.GetBytes(myString);
myString = Encoding.UTF8.GetString(bytes);

Another thing you may have to remember: If you are using Console.WriteLine to output some strings, then you should also write Console.OutputEncoding = System.Text.Encoding.UTF8;!!! Or all utf8 strings will be outputed as gbk...

ch271828n
  • 15,854
  • 5
  • 53
  • 88
anothershrubery
  • 20,461
  • 14
  • 53
  • 98
  • This works too it's actually much nicer than my answer which also works I am giving you a +1 nice work – MethodMan Dec 27 '12 at 16:36
  • Thanks! The problem is that, as I mentioned in the description, the API for surface is incomplete (no Encoding.Default available for me). – Gaara Dec 28 '12 at 16:06
  • 4
    @Gaara: Try `Encoding.GetEncoding(...)`; you will need to find the name of the actual encoding that was incorrectly used at the other end. – SLaks Dec 28 '12 at 16:49
  • We get crash reports from around the world that get inserted into a UTF8 database. I was getting an encoding error when inserting some crash reports from Europe. This transform allowed the reports to get inserted. Thanks a lot. – Adam Bruss Sep 27 '13 at 20:47
  • **TIP**: If you are using `Console.WriteLine` to output some strings, then you should also write *`Console.OutputEncoding = System.Text.Encoding.UTF8;`*!!! Or all utf8 strings will be outputed as gbk... – ch271828n Jul 05 '17 at 09:42
  • 1
    can you explain why this works? if Default is GB2312, then Encoding.Default.GetBytes will encode string to byte array use GB2312 encoder, then Encoding.UTF8.GetString will try to decode the byte array use UTF8 decoder, the result should be wrong, but why this works. @anothershrubery – guorongfei Feb 27 '18 at 01:41
  • 2
    @guorongfei The premise is that `myString` is mojibake. The code first undoes the wrong decoding then does the right decoding. It works as long as the wrong decoding hasn't lost data. But as @SLaks pointed out, it would be better to use the exact encoding that was wrong. (Better names and comments in the code would help in understanding how very wrong-looking code is actually an attempt at doing right.) – Tom Blodget Mar 01 '18 at 00:07
  • @TomBlodget Thank you very much for help. I got c# string field `myString` which contains chinese character, visual studio shows these character correctly, but i still have to do the aforementioned convert since i use some lib which only handle utf8 string. In this case, do you mean that `myString` is mojibake? why visual studio shows those character correctly. – guorongfei Mar 02 '18 at 01:52
  • @guorongfei Please post a separate question with a minimal version of your code and data. It seems like a different issue. If we discover that it's not, we'll get it linked back to this one. – Tom Blodget Mar 02 '18 at 12:49
  • Its turn out to be my own mistake. I passed a C# string to c++/cli and use marshal_as to convert it to std::string before i pass the result to some cpp lib which only accept utf8 encoded string. Thank you very much for help me to figure it out @TomBlodget – guorongfei Mar 06 '18 at 08:03
  • So many up-votes, so many bugs out there... A simple non-breaking space breaks the conversion if you have an `iso-8859-1` code page set. Consider `Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes("\xa0"));` vs `Encoding.Default.GetBytes(s)`. The latter will give you one byte, which is **incorrect**. – l33t Apr 28 '20 at 07:09
24
string utf8String = "Acción";
string propEncodeString = string.Empty;

byte[] utf8_Bytes = new byte[utf8String.Length];
for (int i = 0; i < utf8String.Length; ++i)
{
   utf8_Bytes[i] = (byte)utf8String[i];
}

propEncodeString = Encoding.UTF8.GetString(utf8_Bytes, 0, utf8_Bytes.Length);

Output should look like

Acción

day’s displays day's

call DecodeFromUtf8();

private static void DecodeFromUtf8()
{
    string utf8_String = "day’s";
    byte[] bytes = Encoding.Default.GetBytes(utf8_String);
    utf8_String = Encoding.UTF8.GetString(bytes);
}
hellow
  • 12,430
  • 7
  • 56
  • 79
MethodMan
  • 18,625
  • 6
  • 34
  • 52
  • 1
    Thanks! It does work in Spanish, the problem is that the same wouldn't work on eastern languages (i.e. korean). I'm trying to look for a 8-bit to UTF-8 conversion algorithm in the internet, but still no luck. – Gaara Dec 28 '12 at 16:06
  • Here is an example of a string that doesn't work on English language: instead of displaying day's , my front end app is displaying: day’s – Gaara Dec 28 '12 at 17:17
  • ok let me mess around with it and see what I can come up with – MethodMan Dec 28 '12 at 17:19
  • I tested and it returns day's I will paste the static method that I tested it's actually the same as what @anothershrubery has provided – MethodMan Dec 28 '12 at 17:25
  • you can alter that method by passing DecodeFromUtf8(string utf8string); – MethodMan Dec 28 '12 at 17:28
13

Your code is reading a sequence of UTF8-encoded bytes, and decoding them using an 8-bit encoding.

You need to fix that code to decode the bytes as UTF8.

Alternatively (not ideal), you could convert the bad string back to the original byte array—by encoding it using the incorrect encoding—then re-decode the bytes as UTF8.

SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
  • Thanks! The problem is that the third party app is C++, while my code is C#, so I guess the decoding happens in the "bridge" between those two. – Gaara Dec 28 '12 at 16:07
10
 Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(mystring));
Riadh Hammouda
  • 111
  • 2
  • 6
10

@anothershrubery answer worked for me. I've made an enhancement using StringEntensions Class so I can easily convert any string at all in my program.

Method:

public static class StringExtensions
{
    public static string ToUTF8(this string text)
    {
        return Encoding.UTF8.GetString(Encoding.Default.GetBytes(text));
    }
}

Usage:

string myString = "Acción";
string strConverted = myString.ToUTF8();

Or simply:

string strConverted = "Acción".ToUTF8();
5

If you want to save any string to mysql database do this:->

Your database field structure i phpmyadmin [ or any other control panel] should set to utf8-gerneral-ci

2) you should change your string [Ex. textbox1.text] to byte, therefor

2-1) define byte[] st2;

2-2) convert your string [textbox1.text] to unicode [ mmultibyte string] by :

byte[] st2 = System.Text.Encoding.UTF8.GetBytes(textBox1.Text);

3) execute this sql command before any query:

string mysql_query2 = "SET NAMES 'utf8'";
cmd.CommandText = mysql_query2;
cmd.ExecuteNonQuery();

3-2) now you should insert this value in to for example name field by :

cmd.CommandText = "INSERT INTO customer (`name`) values (@name)";

4) the main job that many solution didn't attention to it is the below line: you should use addwithvalue instead of add in command parameter like below:

cmd.Parameters.AddWithValue("@name",ut);

++++++++++++++++++++++++++++++++++ enjoy real data in your database server instead of ????

Tomas Kubes
  • 23,880
  • 18
  • 111
  • 148
3

Use the below code snippet to get bytes from csv file

protected byte[] GetCSVFileContent(string fileName)
    {
        StringBuilder sb = new StringBuilder();
        using (StreamReader sr = new StreamReader(fileName, Encoding.Default, true))
        {
            String line;
            // Read and display lines from the file until the end of 
            // the file is reached.
            while ((line = sr.ReadLine()) != null)
            {
                sb.AppendLine(line);
            }
        }
        string allines = sb.ToString();


        UTF8Encoding utf8 = new UTF8Encoding();


        var preamble = utf8.GetPreamble();

        var data = utf8.GetBytes(allines);


        return data;
    }

Call the below and save it as an attachment

           Encoding csvEncoding = Encoding.UTF8;
                   //byte[] csvFile = GetCSVFileContent(FileUpload1.PostedFile.FileName);
          byte[] csvFile = GetCSVFileContent("Your_CSV_File_NAme");


        string attachment = String.Format("attachment; filename={0}.csv", "uomEncoded");

        Response.Clear();
        Response.ClearHeaders();
        Response.ClearContent();
        Response.ContentType = "text/csv";
        Response.ContentEncoding = csvEncoding;
        Response.AppendHeader("Content-Disposition", attachment);
        //Response.BinaryWrite(csvEncoding.GetPreamble());
        Response.BinaryWrite(csvFile);
        Response.Flush();
        Response.End();
jAntoni
  • 591
  • 1
  • 12
  • 28