0

I would like to convert a string variable from UTF8 to ISO-8859-1, because for special character like ä,ö,ü, I see ? in C#. To achieve this goal, I have found this post. But it does not work for me. I have tried to find out why....

I have observed the bytes of original and converted string in C# with this code:

 System.IO.MemoryStream stream = new System.IO.MemoryStream();
 System.Runtime.Serialization.IFormatter formatter = new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();
 formatter.Serialize(stream, dt2.Rows[0][0]); // I read my string from a datatable and it is utf8 encoded
  byte[] bytes = stream.GetBuffer(); 

This line of code:

Console.WriteLine(BitConverter.ToString(bytes).Replace("-", ""));

returns:

4652495343484BEFBFBD53455A55424552454954554E47454E2020

Now, I would like to encode to ISO-8859-1. For this, I use this code:

var srcEncoding = Encoding.Default;   // The original bytes are utf8 hence here "Default"
var destEncoding = Encoding.GetEncoding("ISO-8859-1");
var destBytes = Encoding.Convert(srcEncoding, destEncoding, bytes);

and then run the line of code:

Console.WriteLine(BitConverter.ToString(destBytes).Replace("-", ""));

I get the same hex code. It seems that the conversion doesn't work properly

4652495343484BEFBFBD53455A55424552454954554E47454E2020

Do you have any idea why the conversion doesn't work for me?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Kaja
  • 2,962
  • 18
  • 63
  • 99
  • 1
    Eight bit ASCII contains character 0x00 to 0x7f which is display the same for all encoding methods. Character 0x80 to 0xFF are displayed different depending on encoding used. So there is no conversion needed. The bytes are the same for all 8 bit encoding methods. – jdweng May 12 '19 at 15:55
  • What should I do then to get the right chars like ä,ö,ü,... If I do no conversion, then still see ? or another things for these chars – Kaja May 12 '19 at 16:02
  • 1
    @jdweng: There's a multibyte sequence thrown in there, `EFBFBD`, which appears to be a valid UTF-8 encoding of `\uFFFD` – Ben Voigt May 12 '19 at 16:02
  • @Kaja: Have you tried with `srcEncoding = Encoding.UTF8` ? Using `Encoding.Default` just makes it harder to tell if the code is correct. – Ben Voigt May 12 '19 at 16:03
  • @BenVoigt yes but doesnt help me and still see ? instead of ä – Kaja May 12 '19 at 16:04
  • If I change it to utf8, as you said, I get this hex: `4652495343484B3F53455A55424552454954554E47454E2020` there is not ä in it. – Kaja May 12 '19 at 16:06
  • 1
    Your input string doesn't contain a-umlaut. It contains ["Unicode replacement character"](https://www.fileformat.info/info/unicode/char/fffd/index.htm) Whatever conversion happened before the data was stored has already lost your a-umlaut. – Ben Voigt May 12 '19 at 16:07
  • @BenVoigt sure I have ä in my text. Actually my string come from sqlite database. in my DB Browser for this string if I encoded my column to `CP850` then I see ä – Kaja May 12 '19 at 16:08
  • @Ben Voigt : 16 bit is unicode and not 8 bit encoding. – jdweng May 12 '19 at 16:08
  • 1
    @Kaja: Then your database reading code is wrong. But `bytes` does not contain a-umlaut. – Ben Voigt May 12 '19 at 16:09
  • 1
    @jdweng: I don't understand what you are trying to say. His `byte[] bytes` contains UTF-8 encoding of a bunch of ASCII with one UTF-8 encoding of "Unicode replacement character" mixed in. – Ben Voigt May 12 '19 at 16:11
  • You usually can't change encoding once bytes have been encoded to string. The encoding will remove characters that are not valid (like ascii encoding removes non printable character). Usually character aren't display properly because of the method you are using to view the string. Like Console encoding will default to the language in the computer settings and may not show properly. Also the font being used may not display character properly. – jdweng May 12 '19 at 16:14
  • 1
    why would you want to convert? UTF-8 can store every possible characters. And if it doesn't print correctly then it's the fault of the console or the printing function. This is an XY problem. See [Outputting a Unicode character in C#](https://stackoverflow.com/q/3162116/995714), [How to write Unicode characters to the console?](https://stackoverflow.com/q/5750203/995714), [c# how to output Unicode characters?](https://stackoverflow.com/q/40364627/995714) – phuclv May 12 '19 at 16:43
  • 1
    @phuclv: He doesn't have the right UTF-8 string to begin with. What he does have, is stored in the database in CP850 and then turned into a mess by the database access library. – Ben Voigt May 12 '19 at 16:50

2 Answers2

2

Your string doesn't contain a-umlaut.

It contains "Unicode replacement character".

Whatever conversion happened before you got byte[] bytes has already lost your a-umlaut.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • please see my last post [here](https://social.msdn.microsoft.com/Forums/vstudio/en-US/681a2c71-8417-484a-bf63-58d6a2d05d79/retrieving-special-characters-from-sqlite-in-c?forum=csharpgeneral). I can see ä – Kaja May 12 '19 at 16:10
  • If i run the query `SELECT hex(c1) FROM t1 where c2=338` then I get this hex: `4652495343484B8E53455A55424552454954554E47454E2020` – Kaja May 12 '19 at 16:13
  • 1
    @Kaja: Then the first for lines of code for retrieving the "original" are not working. Because the result of those four lines has "Unicode Replacement Character" not a-umlaut. – Ben Voigt May 12 '19 at 16:13
  • Do you have any Idea how can Iget the original bytes in C#? – Kaja May 12 '19 at 16:14
  • 1
    Do you see a difference between that string and the one in your question? Because I do. They aren't even the same length, and your SELECT query result doesn't contain `EFBFBD` which is the UTF-8 encoding of the replacement character. – Ben Voigt May 12 '19 at 16:15
  • You are right. Is it a bug in c#, that I can not get original bytes? – Kaja May 12 '19 at 16:16
  • @Kaja: That's a database question not a string conversion question. It might be a bug in the SQLite library you are using. I have no idea. – Ben Voigt May 12 '19 at 16:17
  • my last question: How do you see this one, can I get from this hex, the right string with ä: `A4652495343484BC28E53455A55424552454954554E47454E2020` – Kaja May 12 '19 at 16:24
  • 1
    @Kaja: Is that in a byte array? Try `Encoding.GetEncoding("ISO-8859-1").GetString(array)`. If you have a string containing the hex representation, first [use `SoapHexBinary.Parse` as shown here](https://stackoverflow.com/a/2556329/103167) to get the array. – Ben Voigt May 12 '19 at 16:27
  • Actually I have changed the connection string. Originaly I had such a con string:`Data Source=C:\Test\test.001;UseUTF8Encoding=True` then I have changed my con to `Data Source=C:\Test\test.001;UseUTF16Encoding=True;Synchronous=Normal;New=False` as you can see I have changed utf8 to utf16 and bytes which you see in my last post are from utf16 – Kaja May 12 '19 at 16:31
  • 1
    I'm positive those aren't UTF-16. They look like raw CP850. – Ben Voigt May 12 '19 at 16:33
  • but still cant see ä in my string. :(. I see FRISCHK?SEZUBEREITUNGEN in the consol without any conversion – Kaja May 12 '19 at 16:34
  • 1
    Did some more research, CP850 is not compatible with ISO-8859-1. Trying to figure out the C# name for CP850. Might be just `Encoding.GetEncoding(850)` – Ben Voigt May 12 '19 at 16:36
  • Sorry for disturbing, but isn't it, if I get the bytes in CP850, then I don't need to do a conversion? I thought CP850 or 850 can show ä – Kaja May 12 '19 at 16:40
  • 2
    @Kaja: I think I figured it out. Your hex bytes are not "raw CP850", they are a UTF-8 transformation of CP850. So you need to undo the UTF-8, then decode as CP850. Demonstration: https://rextester.com/CUZBR49827 – Ben Voigt May 12 '19 at 16:46
  • 1
    Your earlier string, the one straight from `SELECT`, was actually a bit easier to deal with since it was real raw CP850, no abused UTF-8 layer on top. – Ben Voigt May 12 '19 at 16:51
  • God save you! now I can see Ä, but I see some irrelevant character also: ` ???? FRISCHKÄSEZUBEREITUNGEN ` Do you have any Idea why? – Kaja May 12 '19 at 16:53
  • The string you gave me most recently seems to start with `0A` which is a newline character. The result of the SELECT doesn't have that, nor the `C28E` which should be just `8E`. – Ben Voigt May 12 '19 at 17:17
  • Thank you so much. generally I know what should I do. Hopefully I can achieve my aim :) – Kaja May 12 '19 at 18:03
  • what about if I would like to work with the original hex in database: `4652495343484B8E53455A55424552454954554E47454E2020` as I understand it is CP850, which contains ä. correct? That means if I would like to show ä in the console, I should only convert it into ISO-8859-1 and I dont need to convert it first to UTF8, is it correct? – Kaja May 12 '19 at 18:17
  • For that string, just call `Encoding.GetEncoding(850).GetString(array)` immediately. No UTF8 and certainly no ISO-8859-1. – Ben Voigt May 12 '19 at 19:44
0

There is no reason to mess with MemoryStreams and BinaryFormatters. Just use the methods GetString and GetBytes of the appropriate Encoding.

byte[] oldBytes = new byte[] { 0x46, 0x52, 0x49, 0x53, 0x43, 0x48,
    0x4B, 0xEF, 0xBF, 0xBD, 0x53, 0x45, 0x5A, 0x55, 0x42, 0x45, 0x52,
    0x45, 0x49, 0x54, 0x55, 0x4E, 0x47, 0x45, 0x4E, 0x20, 0x20 };
Console.WriteLine($"oldBytes: {BitConverter.ToString(oldBytes)} ({oldBytes.Length})");

string oldStr = Encoding.UTF8.GetString(oldBytes);
Console.WriteLine($"oldStr: <{oldStr}>");

byte[] newBytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(oldStr);
Console.WriteLine($"newBytes: {BitConverter.ToString(newBytes)} ({newBytes.Length})");

string newStr = Encoding.GetEncoding("ISO-8859-1").GetString(newBytes);
Console.WriteLine($"newStr: <{newStr}>");

Output:

oldBytes: 46-52-49-53-43-48-4B-EF-BF-BD-53-45-5A-55-42-45-52-45-49-54-55-4E-47-45-4E-20-20 (27)
oldStr: <FRISCHK�SEZUBEREITUNGEN  >  
newBytes: 46-52-49-53-43-48-4B-3F-53-45-5A-55-42-45-52-45-49-54-55-4E-47-45-4E-20-20 (25)
newStr: <FRISCHK?SEZUBEREITUNGEN  >  
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • 1
    "garbage in, garbage out". Your `oldBytes` array contains "Unicode replacement character" garbage. – Ben Voigt May 12 '19 at 19:46
  • @BenVoigt Yeap, the bytes are most probably corrupted. The [replacement character](https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character) is a good indication that something bad happened earlier. I wanted mainly to show a simpler API for converting text to bytes and vice versa. – Theodor Zoulias May 12 '19 at 21:45