26

I had converted a Word Document(docx) to html, the converted html has windows-1252 as its character encoding. In .Net for this 1252 character encoding all the special characters are being displayed as '�'. This html is being displayed in a Rad Editor which displays correctly if the html is in Utf-8 format.

I had tried the following code but no vein

Encoding wind1252 = Encoding.GetEncoding(1252);  
Encoding utf8 = Encoding.UTF8;  
byte[] wind1252Bytes = wind1252.GetBytes(strHtml);  
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);  
char[] utf8Chars = new char[utf8.GetCharCount(utf8Bytes, 0, utf8Bytes.Length)];   
utf8.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars, 0);  
string utf8String = new string(utf8Chars);

Any suggestions on how to convert the html into UTF-8?

Oded
  • 489,969
  • 99
  • 883
  • 1,009
Varun0554
  • 391
  • 1
  • 3
  • 8
  • 3
    Depending on the type of project you have (e.g. .NetCore), you might also need to first install the Nuget package `System.Text.Encoding.CodePages` and do a initialisation in the class constructor, with `Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);` – FranzHuber23 Jan 11 '19 at 07:46
  • See https://gist.github.com/SeppPenner/ae65fccdd81bce23cd8818ffe22589c1 for an example. – FranzHuber23 Jan 11 '19 at 08:03

4 Answers4

25

This should do it:

Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;  
byte[] wind1252Bytes = wind1252.GetBytes(strHtml);
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);
scottrudy
  • 1,633
  • 1
  • 14
  • 24
13

Actually the problem lies here

byte[] wind1252Bytes = wind1252.GetBytes(strHtml); 

We should not get the bytes from the html String. I tried the below code and it worked.

Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
byte[] wind1252Bytes = ReadFile(Server.MapPath(HtmlFile));
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);


public static byte[] ReadFile(string filePath)      
    {      
        byte[] buffer;   
        FileStream fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);  
        try
        {
            int length = (int)fileStream.Length;  // get file length    
            buffer = new byte[length];            // create buffer     
            int count;                            // actual number of bytes read     
            int sum = 0;                          // total number of bytes read    

            // read until Read method returns 0 (end of the stream has been reached)    
            while ((count = fileStream.Read(buffer, sum, length - sum)) > 0)
                sum += count;  // sum is a buffer offset for next reading
        }
        finally
        {
            fileStream.Close();
        }
        return buffer;
    }
Jon Egerton
  • 40,401
  • 11
  • 97
  • 129
Varun0554
  • 391
  • 1
  • 3
  • 8
  • 1
    OK; i don't think i'm catching it - you are saying don't GetBytes from the .net string, but rather binary-read it directly from the filesystem. Why does this work? b/c the .net string is internally UTF-16? – mlhDev Mar 06 '12 at 15:09
  • 2
    i think he wants to say that system locale effects bytes so it never encode good so instead need to read real source to get real bytes and then convert. – Tommix May 21 '15 at 12:53
0

How you are planning to use resulting html? The most appropriate way in my opinion to solve your problem would be add meta with encoding specification. Something like:

<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
Anton Semenov
  • 6,227
  • 5
  • 41
  • 69
-1

Use Encoding.Convert method. Details are in the Encoding.Convert method MSDN article.

Eugene Cheverda
  • 8,760
  • 2
  • 33
  • 18