117

I have googled on this topic and I have looked at every answer, but I still don't get it.

Basically I need to convert UTF-8 string to ISO-8859-1 and I do it using following code:

Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
string msg = iso.GetString(utf8.GetBytes(Message));

My source string is

Message = "ÄäÖöÕõÜü"

But unfortunately my result string becomes

msg = "�ä�ö�õ�ü

What I'm doing wrong here?

Tieme
  • 62,602
  • 20
  • 102
  • 156
Daniil Harik
  • 4,619
  • 10
  • 55
  • 60
  • 5
    All strings in .NET internally store the strings using unicode characters. There is no notion of a String being "windows-1252", "iso-8859-1", "utf-8", etc. Are you trying to throw away any characters in your string that do not have a representation in the Windows-1252 code page? – Ian Boyd Dec 17 '09 at 14:58
  • 1
    @IanBoyd Actually, a [String](https://msdn.microsoft.com/en-us/library/system.string(v=vs.110).aspx) is a counted sequence of UTF-16 code units. (Unfortunately, the term Unicode has been misapplied in `Encoding.Unicode` and in the Win32 API. Unicode is a character set, not an encoding. UTF-16 is one of several encodings for Unicode.) – Tom Blodget Nov 19 '16 at 15:36
  • 1
    You make incorrect action: you make byte array in utf8 encoding, but read them by iso decode. If you want make string with encoded symbols it simple call **string msg = iso.GetString(iso.GetBytes(Message));** – StuS Sep 06 '17 at 15:12
  • That's called Mojibake. – Rick James Jul 13 '18 at 18:23
  • I guess what Daniil is saying is that `Message` was decoded from UTF-8. Assuming that part worked correctly, converting to Latin-1 is as simple as `byte[] bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(Message)`. Then, like StuS says, you can convert the Latin-1 bytes back to UTF-16 with `Encoding.GetEncoding("ISO-8859-1").GetString(bytes)` – Qwertie Oct 30 '19 at 15:01

9 Answers9

203

Use Encoding.Convert to adjust the byte array before attempting to decode it into your destination encoding.

Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(Message);
byte[] isoBytes = Encoding.Convert(utf8, iso, utfBytes);
string msg = iso.GetString(isoBytes);
AaronLS
  • 37,329
  • 20
  • 143
  • 202
Nathan Baulch
  • 20,233
  • 5
  • 52
  • 56
  • 7
    The one liner is `Encoding.GetEncoding("ISO-8859-1").GetString(Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding("ISO-8859-1"), Encoding.UTF8.GetBytes(myString)))` –  Dec 11 '15 at 15:35
  • 1
    If you are creating the string yourself inside C#/.Net, then this code is not 100% correct, you need to encode from UTF-16 (which is the variable "Unicode"). Because this is the default. So UTF8 in the code above has to be changed to Unicode. – goamn Jun 01 '17 at 01:42
  • 1
    I recommend to use this: Encoding iso = Encoding.GetEncoding("ISO-8859-9"); Because turkish encoding covers allmost all alphabet extended from Latin. – Fuat Aug 31 '18 at 10:55
  • 3
    You know, `isoBytes` is also just `iso.GetBytes(Message);`. There is no need to convert anything here. In fact, you can just skip all that and say `string msg = Message`. There's no real point to _any_ of these conversions, since the start and end are both just a .Net `String`. And text encodings are irrelevant on a .Net `String` as long as you don't need to handle it as bytes. – Nyerguds Nov 02 '20 at 08:51
  • This answer is missleading to users unfammiliar with encoding. The code above in sum does exactly nothing to the string, because Encoding.Convert reverses the change in encoding. Encoding.Convert is defined as dstEncoding.GetBytes(srcEncoding.GetChars(bytes);. Plugged into your code we obtain dstEncoding.GetChars(dstEncoding.GetBytes(srcEncoding.GetChars(srcEncoding.GetBytes(Message));. All operations here reverse themselves. This is equivalent to Message. – Prophet Lamb Feb 18 '23 at 17:33
  • @Nyerguds and @Prophet Lamb = I think your comments are basically making the same point and they're correct (I think) based on the original question's input string specifically, but not for any string. This answer is exactly what I needed as it replaces UTF-8 characters that aren't present in ISO-8859-1 with fallbacks (e.g. `Ż` becomes `Z` with this code; I'm not sure what it does if there is no fallback), so it's not equivalent to `string msg = Message` as you both stated. – jsabrooke Jul 19 '23 at 21:43
31

I think your problem is that you assume that the bytes that represent the utf8 string will result in the same string when interpreted as something else (iso-8859-1). And that is simply just not the case. I recommend that you read this excellent article by Joel spolsky.

Klaus Byskov Pedersen
  • 117,245
  • 29
  • 183
  • 222
  • 2
    Excellent article indeed and with a sense of humor! I was facing an encoding issue today at work and this helped me out. – Pantelis Aug 23 '12 at 13:18
16

Try this:

Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(Message);
byte[] isoBytes = Encoding.Convert(utf8,iso,utfBytes);
string msg = iso.GetString(isoBytes);
Manu
  • 28,753
  • 28
  • 75
  • 83
  • why i am getting same utf-8 message?in place of message i passed string message=sdjfhsjdf.then same output getting in msg varieable.how to get latin data ? – user1237131 Jan 09 '13 at 05:43
  • This works for me. Remember to include System.Text namespace. – Spawnrider Jun 03 '13 at 14:29
  • 2
    Encoding.Convert throws fallback exception while converting if string has non-iso characters – Tertium May 15 '14 at 14:53
  • This answer is missleading to users unfammiliar with encoding. The code above in sum does exactly nothing to the string, because Encoding.Convert reverses the change in encoding. Encoding.Convert is defined as `dstEncoding.GetBytes(srcEncoding.GetChars(bytes);`. Plugged into your code we obtain `dstEncoding.GetChars(dstEncoding.GetBytes(srcEncoding.GetChars(srcEncoding.GetBytes(Message));`. All operations here reverse themselves. This is equivalent to `Message`. – Prophet Lamb Feb 18 '23 at 17:32
9

You need to fix the source of the string in the first place.

A string in .NET is actually just an array of 16-bit unicode code-points, characters, so a string isn't in any particular encoding.

It's when you take that string and convert it to a set of bytes that encoding comes into play.

In any case, the way you did it, encoded a string to a byte array with one character set, and then decoding it with another, will not work, as you see.

Can you tell us more about where that original string comes from, and why you think it has been encoded wrong?

Lasse V. Karlsen
  • 380,855
  • 102
  • 628
  • 825
  • It's coming directly from App.config and I was thinking it's UTF8 by default. Thank You! – Daniil Harik Dec 17 '09 at 14:47
  • The encoding of that file might impact how the file gets interpreted, so I would look at that. – Lasse V. Karlsen Dec 17 '09 at 16:23
  • 2
    Correct me if I'm wrong, but my understanding is that, while technically it "isn't in any particular encoding", a .NET string is a byte array that corresponds precisely to a UTF-16 file, byte for byte (excluding the BOM). It even uses surrogates in the same way (which seems like an encoding trick). Of course, you generally want to store files as UTF-8 but process the data in memory as 16-bit. (Or 32-bit, to avoid the complexity of surrogate pairs, though I'm not sure if that's really feasible.) – Jon Coombs Sep 27 '13 at 00:58
  • @JonCoombs I don't think that's correct. UTF-16 works with expanding opcodes. The .Net strings just use an array of 16-bit code points, without any expansion. As far as I know it only supports the 0000-FFFF range. – Nyerguds Nov 02 '20 at 08:56
8

Seems bit strange code. To get string from Utf8 byte stream all you need to do is:

string str = Encoding.UTF8.GetString(utf8ByteArray);

If you need to save iso-8859-1 byte stream to somewhere then just use: additional line of code for previous:

byte[] iso88591data = Encoding.GetEncoding("ISO-8859-1").GetBytes(str);
Sander A
  • 91
  • 1
  • 1
  • 1
    This is clearly the most straightforward answer. The problem in the code is indeed that the author seems to assume that a String in C# can already be stored "using" a certain encoding, which simply isn't true; they're always UTF16 internally. – Nyerguds Mar 14 '16 at 12:33
  • 1
    Fully agree. When you already have UTF-16, it is quite hard to make that into correct encoding, because when you converted byte array to string with wrong encoding there is already loss of information. – Sander A Mar 18 '16 at 14:01
1

Maybe it can help
Convert one codepage to another:

    public static string fnStringConverterCodepage(string sText, string sCodepageIn = "ISO-8859-8", string sCodepageOut="ISO-8859-8")
    {
        string sResultado = string.Empty;
        try
        {
            byte[] tempBytes;
            tempBytes = System.Text.Encoding.GetEncoding(sCodepageIn).GetBytes(sText);
            sResultado = System.Text.Encoding.GetEncoding(sCodepageOut).GetString(tempBytes);
        }
        catch (Exception)
        {
            sResultado = "";
        }
        return sResultado;
    }

Usage:

string sMsg = "ERRO: Não foi possivel acessar o servico de Autenticação";
var sOut = fnStringConverterCodepage(sMsg ,"ISO-8859-1","UTF-8"));

Output:

"Não foi possivel acessar o servico de Autenticação"
nandox
  • 81
  • 5
0
Encoding targetEncoding = Encoding.GetEncoding(1252);
// Encode a string into an array of bytes.
Byte[] encodedBytes = targetEncoding.GetBytes(utfString);
// Show the encoded byte values.
Console.WriteLine("Encoded bytes: " + BitConverter.ToString(encodedBytes));
// Decode the byte array back to a string.
String decodedString = Encoding.Default.GetString(encodedBytes);
Tomáš Opis
  • 289
  • 3
  • 6
-1

Just used the Nathan's solution and it works fine. I needed to convert ISO-8859-1 to Unicode:

string isocontent = Encoding.GetEncoding("ISO-8859-1").GetString(fileContent, 0, fileContent.Length);
byte[] isobytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(isocontent);
byte[] ubytes = Encoding.Convert(Encoding.GetEncoding("ISO-8859-1"), Encoding.Unicode, isobytes);
return Encoding.Unicode.GetString(ubytes, 0, ubytes.Length);
PiotrWolkowski
  • 8,408
  • 6
  • 48
  • 68
Nicolai Nita
  • 173
  • 9
-5

Here is a sample for ISO-8859-9;

protected void btnKaydet_Click(object sender, EventArgs e)
{
    Response.Clear();
    Response.Buffer = true;
    Response.ContentType = "application/vnd.openxmlformatsofficedocument.wordprocessingml.documet";
    Response.AddHeader("Content-Disposition", "attachment; filename=XXXX.doc");
    Response.ContentEncoding = Encoding.GetEncoding("ISO-8859-9");
    Response.Charset = "ISO-8859-9";
    EnableViewState = false;


    StringWriter writer = new StringWriter();
    HtmlTextWriter html = new HtmlTextWriter(writer);
    form1.RenderControl(html);


    byte[] bytesInStream = Encoding.GetEncoding("iso-8859-9").GetBytes(writer.ToString());
    MemoryStream memoryStream = new MemoryStream(bytesInStream);


    string msgBody = "";
    string Email = "mail@xxxxxx.org";
    SmtpClient client = new SmtpClient("mail.xxxxx.org");
    MailMessage message = new MailMessage(Email, "mail@someone.com", "ONLINE APP FORM WITH WORD DOC", msgBody);
    Attachment att = new Attachment(memoryStream, "XXXX.doc", "application/vnd.openxmlformatsofficedocument.wordprocessingml.documet");
    message.Attachments.Add(att);
    message.BodyEncoding = System.Text.Encoding.UTF8;
    message.IsBodyHtml = true;
    client.Send(message);}
Engin K.
  • 39
  • 5