0

I was asked to solve an encoding problem in a file. It was expected to be in UTF8 but it was actually in extended ASCII.

The result is a file with cases like this:

Brasília; Eletrônicos e Informática Câmeras e Acessórios música

When it should actually be :

Brasília Eletrônicos e Informática Câmeras e Acessórios música

I solved it with this code :

private static string FixEncodingIssues(string str)
        {
            string fixedStr = str;

            foreach (KeyValuePair<string, string> pair in encodingErrosDic)
                fixedStr = fixedStr.Replace(pair.Key,pair.Value);
            
            return fixedStr;
        }

        private static Dictionary<string, string> encodingErrosDic = new Dictionary<string, string>()
        {
            { "Ã" , "Ã" },
            { "Ã\x81"  , "Á" },
            { "À" , "À" },
            { "Â" , "Â" },
            { "Ä" , "Ä" },
            { "Ã…" , "Å" },
            { "Ç" , "Ç" },
            { "È" , "È" },
            { "É" , "É" },
            { "Ê" , "Ê" },
            { "Ë" , "Ë" },
            { "ÃŒ" , "Ì" },
            { "Ã\x8D"  , "Í" },
            { "ÃŽ" , "Î" },
            { "Ã\x8F"  , "Ï" },
            { "Ã\x90"  , "Ð" },
            { "Ñ" , "Ñ" },
            { "Ã’" , "Ò"},
            { "Ó" , "Ó" },
            { "Ô" , "Ô" },
            { "Õ" , "Õ" },
            { "Ö" , "Ö" },
            { "×" , "×" },
            { "Ø" , "Ø" },
            { "Ù" , "Ù" },
            { "Ú" , "Ú" },
            { "Û" , "Û" },
            { "Ü" , "Ü" },
            { "Ã\x9D" , "Ý" },
            { "Ã\xA0" , "à" },
            { "á" , "á" },
            { "â" , "â" },
            { "ã" , "ã" },
            { "ä" , "ä" },
            { "Ã¥" , "å" },
            { "æ" , "æ" },
            { "ç" , "ç" },
            { "è" , "è" },
            { "é" , "é" },
            { "ê" , "ê"},
            { "ë" , "ë" },
            { "ì" , "ì" },
            { "î" , "î" },
            { "ï" , "ï" },
            { "Ã\xAD" , "í" },
            { "ð" , "ð" },
            { "ñ" , "ñ" },
            { "ò" , "ò" },
            { "ó" , "ó" },
            { "ô" , "ô" },
            { "õ" , "õ" },
            { "ö" , "ö" },
            { "ø" , "ø" },
            { "ù" , "ù" },
            { "ú" , "ú" },
            { "û" , "û" },
            { "ü" , "ü" },
            { "ý" , "ý" }
        };

I would like to know if there is a nicer way to solve this issue. I feel that my solution is too rough, it won't work for bytes not listed in the dictionary. I wished to know if there is a cleaner solution that doesn't involve listing all the extended cases and replacing them with equivalent UTF8 values.

Carlos Siestrup
  • 1,031
  • 2
  • 13
  • 33
  • When you say "extended ascii", what encoding is it actually in? "Extended ascii" just means "using 8 bits for the char codes" without saying anything about what characters the upper 128 character codes map to. Is it an ANSI code page? – Matthew Watson Oct 15 '20 at 14:01
  • 4
    Actually, it looks like your input file *is* encoded in UTF-8, but incorrectly interpreted as being in Windows-1252, leading to the mojibake. `Encoding.UTF8.GetString(Encoding.GetEncoding(1252).GetBytes("Brasília; Eletrônicos e Informática Câmeras e Acessórios música"))` gives the expected result, suggesting that the code reading the file is doing it wrong, not the encoder. This often happens when relying on the "default" code page for reading (many people assume this to be UTF-8, but it actually almost never is, and depends on the system). – Jeroen Mostert Oct 15 '20 at 14:06
  • See the [Wikipedia article on Extended ASCII](https://en.wikipedia.org/wiki/Extended_ASCII) for information about this poorly-defined name. What you're seeing is likely ISO 8859-1 or the Windows code page 1252 superset. See [C# Convert string from UTF-8 to ISO-8859-1 (Latin1) H](https://stackoverflow.com/q/1922199/215552) and reverse the steps. – Heretic Monkey Oct 15 '20 at 14:07
  • Are you using .Net Core or .Net Framework? That will make a difference when working with ANSI code pages. – Matthew Watson Oct 15 '20 at 14:14
  • I`m using .Net Framework @JeroenMostert solution worked. Thanks! – Carlos Siestrup Oct 15 '20 at 14:24
  • Note that for .Net Core, `Encoding.Default` is always UTF8 - this is different from .Net Framework (I should think they changed it to avoid issues like the one above!). – Matthew Watson Oct 15 '20 at 14:26
  • 3
    Note that my line of code was *not* intended to be used as a solution after data has already been corrupted -- this is not going to work for all possible characters! You should make sure to identify the place where the encoding is being applied incorrectly (for example, a `File.ReadAllText` that fails to specify an `Encoding` and is relying on a possibly missing byte order mark), and fix it there. – Jeroen Mostert Oct 15 '20 at 14:27

0 Answers0