1

I have an encoding problem. I have text in my MongoDB that is wrongly encoded. The source file of the texts in my db is encoded in ISO-8859-1. Now, in viewing it in my db, some characters were broken (become '�').

Currently, in retrieving text from db i tried the following codes.

var t = Collection.FindOne(Query.EQ("id", "2014121500892"));
string message = t["b203"].AsString;
Console.WriteLine(ChangeEncoding(message));

First attempt:

static string ChangeEncoding(string message)
{

    System.Text.Encoding srcEnc = System.Text.Encoding.GetEncoding("ISO-8859-1");
    System.Text.Encoding destEnc = System.Text.Encoding.GetEncoding("UTF-8");
    byte[] bData = srcEnc.GetBytes(message);
    byte[] bResult = System.Text.Encoding.Convert(srcEnc, destEnc, bData);
    return destEnc.GetString(bResult);
}

Second attempt:

static string ChangeEncoding(string message)
{
    File.WriteAllText("text.txt", message, Encoding.GetEncoding("ISO-8859-1"));
    return File.ReadAllText("text.txt");
}

Sample text in db:

Box aus Pappe f�r A8-Lernk�rtchen

Desired result:

I want to be able to print it in console as:

Box aus Pappe für A8-Lernkärtchen

helb
  • 7,609
  • 8
  • 36
  • 58
Wylan Osorio
  • 1,136
  • 5
  • 19
  • 46
  • what are you viewing it with? for instance if your IDE isnt set to that encoding it wont know how to show that character. Byte wise its probably correct. – corn3lius Jan 29 '15 at 14:25
  • I just tried viewing on console screen. Console.WriteLine( result here); – Wylan Osorio Jan 29 '15 at 14:26
  • @WylanOsorio I updated the question title to be more specific. I'm sad to say that you are out of luck (see my answer) – helb Jan 29 '15 at 16:12

1 Answers1

5

Short version

Your data is lost and there is no general solution how to recover the original strings.

Longer version

What supposedly happened when the data was stored, the strings where encoded as ISO-8859-1 but stored as Unicode UTF8. Here's an example:

string orig = "Lernkärtchen";
byte[] iso88891Bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(orig);
// { 76, 101, 114, 110, 107, 228, 114, 116, 99, 104, 101, 110 }
//  'L', 'e', 'r', 'n', 'k', 'ä', 'r', 't', 'c', 'h', 'e', 'n'

When this data was passed (somehow...) to the database which only works with Unicode strings:

string storedValue = Encoding.UTF8.GetString(iso88891Bytes);
byte[] dbData = Encoding.UTF8.GetBytes(storedValue);
// { 76, 101, 114, 110, 107, 239, 191, 189, 114, 116, 99, 104, 101, 110 }
//  'L', 'e', 'r', 'n', 'k',      '�',     'r', 't', 'c', 'h', 'e', 'n'

The problem is that the byte 228 (11100100 binary) is not valid for utf8 since for such a byte, 2 other bytes must follow which have values > 127. For details, see UTF8 on Wikipedia, chapter "Description".

So what happens is that the byte formerly known as the character 'ä' cannot be decoded into a valid unicode character and is replaced by the bytes 239, 191 and 189. Which is 11101111, 10111111 and 10111101 which results in the code point with value 1111111111111101 (0xFFFD) which is the character '�' you see in your output.

This character is used for exactly that purpose. On Wikipedia Unicode special characters page it says:

U+FFFD � replacement character used to replace an unknown or unrepresentable character

Try to revert that change? Good luck.

Btw, Unicode and UTF-8 are awesome ♥, never use anything else ☠!

helb
  • 7,609
  • 8
  • 36
  • 58