1

I am not a professional developer, and am having a problem converting Unicode text to ANSI found in a legacy application that doesn't support Unicode.

Here's a sample of what a Unicode-encoded text looks like when displayed in that legacy application:

À chaque journée des quatre jours de colloque, entre 250 et 500 personnes sont venues assister en continu aux discussions de cette rencontre. Cette affluence, ainsi que la richesse et la variété des discussions engagées lors de ces conférences, confirment la nécessité d'un espace ouvert pour les pensées critiques dans le monde francophone, à l'université et bien au-delà .

I notice the following things:

  • All diacritic characters are encoded as C3 ("Ã") + a second byte
  • The character "à" is wrongly encoded as C320 ("Ã ")
  • Windows' CharacterMap application says that "é" is "U+00E9" while the document contains C3A9 instead.

I have a couple of questions:

  1. Why the difference between the document and CharacterMap: Is the document encoded in something else than Unicode? For instance, why is é encoded as C3A9instead of 00E9?

  2. I use the following VB.Net code to convert the document from Unicode to Ansi: How can I replace all occurrences of C320 with à?

    Dim Encw1252 As Encoding = Encoding.GetEncoding("windows-1252")
    Dim EncUTF8 As Encoding = Encoding.GetEncoding("utf-8")
    Dim Str As String
    Str = Encw1252.GetString(Encoding.Convert(EncUTF8, Encw1252, encoding.Default.GetBytes(Clipboard.GetText)))
    Clipboard.SetText(Str)
    
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Gulbahar
  • 5,343
  • 20
  • 70
  • 93

1 Answers1

0

(Answered in a question edit. Converted to a community wiki answer. See What is the appropriate action when the answer to a question is added to the question itself? )

The OP wrote:

For others' benefit, problem solved using the following code:

Dim Encw1252 As Encoding = Encoding.GetEncoding("windows-1252")
Dim EncUTF8 As Encoding = Encoding.GetEncoding("utf-8")

Dim Str As String
Str = Clipboard.GetText
Str = Str.Replace("Ã ", "Ã ")
Str = Encw1252.GetString(Encoding.Convert(EncUTF8, Encw1252, Encoding.Default.GetBytes(Str)))
Clipboard.SetText(Str)
MessageBox.Show(Str)

In the Str.Replace() above, the second byte in the source is a space (20) while the second byte in the target is "No break space" (160).

Community
  • 1
  • 1
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129