Convert and correct corrupt UTF-8 text into ANSI?

Question

I am not a professional developer, and am having a problem converting Unicode text to ANSI found in a legacy application that doesn't support Unicode.

Here's a sample of what a Unicode-encoded text looks like when displayed in that legacy application:

Ã€ chaque journÃ©e des quatre jours de colloque, entre 250 et 500 personnes sont venues assister en continu aux discussions de cette rencontre. Cette affluence, ainsi que la richesse et la variÃ©tÃ© des discussions engagÃ©es lors de ces confÃ©rences, confirment la nÃ©cessitÃ© d'un espace ouvert pour les pensÃ©es critiques dans le monde francophone, Ã l'universitÃ© et bien au-delÃ .

I notice the following things:

All diacritic characters are encoded as C3 ("Ã") + a second byte
The character "à" is wrongly encoded as C320 ("Ã ")
Windows' CharacterMap application says that "é" is "U+00E9" while the document contains C3A9 instead.

I have a couple of questions:

Why the difference between the document and CharacterMap: Is the document encoded in something else than Unicode? For instance, why is é encoded as C3A9instead of 00E9?

I use the following VB.Net code to convert the document from Unicode to Ansi: How can I replace all occurrences of C320 with à?

Dim Encw1252 As Encoding = Encoding.GetEncoding("windows-1252")
Dim EncUTF8 As Encoding = Encoding.GetEncoding("utf-8")
Dim Str As String
Str = Encw1252.GetString(Encoding.Convert(EncUTF8, Encw1252, encoding.Default.GetBytes(Clipboard.GetText)))
Clipboard.SetText(Str)

utf-8 and unicode are different. utf-8 can encode a character up to 4Bytes and unicode is 2Bytes. So if it is utf-8 some characters maybe 3 or 4 bytes. — γηράσκω δ' αεί πολλά διδασκόμε, Feb 27 '14 at 12:45
Most probably. To be sure though, check the actual bytes. If you see a byte with value larger than E0, then it is utf-8. E0 <= character value <= EF the character is encoded with 3bytes and character value >= F0 then 4bytes — γηράσκω δ' αεί πολλά διδασκόμε, Feb 27 '14 at 13:08
`C3A9` is the UTF-8 byte sequence to encode `U+00E9` Unicode code point. Unicode is a character set that maps human signs to abstract numeric *code points*. On the other hand, UTF-8 is a way to encode these *code points* into *byte sequences*. See [this related question](http://stackoverflow.com/questions/643694/utf-8-vs-unicode) for more about this topic. — RandomSeed, Feb 27 '14 at 13:40
Perhaps you could post the solution that worked as an answer, and accept it straight away (so that the question does not appear as unresolved anymore) — RandomSeed, Feb 27 '14 at 13:42

score 0 · Answer 1 · edited Mar 20 '17 at 09:43

(Answered in a question edit. Converted to a community wiki answer. See What is the appropriate action when the answer to a question is added to the question itself? )

The OP wrote:

For others' benefit, problem solved using the following code:
Dim Encw1252 As Encoding = Encoding.GetEncoding("windows-1252")
Dim EncUTF8 As Encoding = Encoding.GetEncoding("utf-8")

Dim Str As String
Str = Clipboard.GetText
Str = Str.Replace("Ã ", "Ã ")
Str = Encw1252.GetString(Encoding.Convert(EncUTF8, Encw1252, Encoding.Default.GetBytes(Str)))
Clipboard.SetText(Str)
MessageBox.Show(Str)
In the Str.Replace() above, the second byte in the source is a space (20) while the second byte in the target is "No break space" (160).

Convert and correct corrupt UTF-8 text into ANSI?

1 Answers1