3

I am getting a string from a third party program that I don't control. My piece of the code outputs this in HTML. This works fine in English, but in other languages it will show in a funny way. For example, accents in Spanish look funny and characters in eastern languages (i.e. korean) will look very funny. I am pretty sure I need to do some encoding work so that all languages display correctly.

My understanding of encoding is kind of poor, so before posting the real question, which I intuitively think it is: "How do I encode this to UTF-8 in C#", I would like to get more understanding on the matter by posting simpler questions.

My question here is: How do I know which type of encoding does my input string has? In Spanish, it looks like this when I get an accent: "Acción", instead of "Acción". Is this ANSI or what am I dealing with?

Thanks a lot in advance!

Gaara
  • 2,117
  • 2
  • 15
  • 15
  • 3
    It is pretty much impossible to tell just from the byte stream. You need to ask the makers of the third party program what encoding it outputs in and read using the same encoding. Chances are (from your description) that this is a Unicode encoding. – Oded Dec 21 '12 at 15:52

1 Answers1

8

I get an accent: "Acción"

The presence of the à character is a dead give-away. Accented capital A characters have character code 0xC0 and up. Which is often the first byte in a two-byte utf-8 encoded character. The ó glyph is codepoint U+00F3, the utf-8 encoding for it is 0xC3 + 0xB3. Which are the codepoints for à and ³

The strings are encoded in utf-8 but you are reading it with an 8-bit encoding like Encoding.Default

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • 1
    Thanks a lot Hans. This totally answers the question. Do you know how I can save this in a String with UTF-8 in C#? Do you suggest me to post this in a new question? – Gaara Dec 21 '12 at 18:58