Encoding issue when handling a string that contains "question mark" (�)

Question

I am parsing some web content in a response from a HttpWebRequest.

This web content is using charset ISO-8859-1 and when parsing it and finally getting the word needed from the response, I am receiving a string with a question mark like this � and I want to know which is the right way to transform it back into a readable string.

So, what I've tried is to convert the current word encoding into UTF-8 like this:

(I am wondering if UTF-8 could solve my problem)

string word = "ESPA�OL";

Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf = Encoding.GetEncoding("UTF-8");

byte[] isoBytes = iso.GetBytes(word);
byte[] utfBytes = Encoding.Convert(iso, utf, isoBytes);

string utfWord = utf.GetString(utfBytes);

Console.WriteLine(utfWord);

However, utfWord variable outputs ESPA?OL which is still wrong. The correct output is supposed to be ESPAÑOL.

Can someone please give me the right directions to solve this, if possible?

� is the substitution character, it is used by encoders to indicate that they can't recognize the *byte(s)* in the text stream. Such as incorrectly using Encoding.UTF8 on text that was encoded with the 8859-1 code page. You cannot fix it afterwards, you've lost the original byte value. It *must* be fixed at the source. — Hans Passant, Mar 17 '14 at 10:55
@HansPassant Yes, I know a little about this but not clear at all. Thanks for explanation, I was hoping to do something with the � but now I know it's not possible (also I don't have access to remote website source) so it seems that they got an encoding problem. — Oscar Jara, Mar 17 '14 at 13:49

score 5 · Accepted Answer · answered Mar 17 '14 at 10:55

The word in question is "ESPAÑOL". This can be encoded correctly in ISO-8859-1 since all characters in the word are represented in ISO-8859-1.

You can see this for yourself using the following simple program:

using System;
using System.Diagnostics;
using System.Text;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            Encoding enc = Encoding.GetEncoding("ISO-8859-1");
            string original = "ESPAÑOL";
            byte[] iso_8859_1 = enc.GetBytes(original);
            string roundTripped = enc.GetString(iso_8859_1);
            Debug.Assert(original == roundTripped);
            Console.WriteLine(roundTripped);
        }
    }
}

What this tells you is that you need to properly diagnose where the erroneous character comes from. By the time that you have a � character, it is too late. The information has been lost. The presence of the � character indicates that, at some point, a conversion was performed into a character set that did not contain the character Ñ.

A conversion from ISO-8859-1 to a Unicode encoding will correctly handle "ESPAÑOL" because that word can be encoded in ISO-8859-1.

The most likely explanation is that somewhere along the way, the text "ESPAÑOL" is being converted to a character set that does not contain the letter Ñ.

Thanks for explanation, finally got it. However, I don't have access to source code (from remote website) so they clearly have a problem with encoding. — Oscar Jara, Mar 17 '14 at 13:46
My original string is "Møhan", It shows as "M�han" but after converting using your code it become "M?han" - How do I get original "Møhan" — kudlatiger, Apr 18 '20 at 04:45

Encoding issue when handling a string that contains "question mark" (�)

1 Answers1

Linked