Convert a string's character encoding from windows-1252 to utf-8

Question

I had converted a Word Document(docx) to html, the converted html has windows-1252 as its character encoding. In .Net for this 1252 character encoding all the special characters are being displayed as '�'. This html is being displayed in a Rad Editor which displays correctly if the html is in Utf-8 format.

I had tried the following code but no vein

Encoding wind1252 = Encoding.GetEncoding(1252);  
Encoding utf8 = Encoding.UTF8;  
byte[] wind1252Bytes = wind1252.GetBytes(strHtml);  
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);  
char[] utf8Chars = new char[utf8.GetCharCount(utf8Bytes, 0, utf8Bytes.Length)];   
utf8.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars, 0);  
string utf8String = new string(utf8Chars);

Any suggestions on how to convert the html into UTF-8?

Depending on the type of project you have (e.g. .NetCore), you might also need to first install the Nuget package `System.Text.Encoding.CodePages` and do a initialisation in the class constructor, with `Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);` — FranzHuber23, Jan 11 '19 at 07:46
See https://gist.github.com/SeppPenner/ae65fccdd81bce23cd8818ffe22589c1 for an example. — FranzHuber23, Jan 11 '19 at 08:03

score 25 · Answer 1 · answered Apr 06 '11 at 14:52

This should do it:

Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;  
byte[] wind1252Bytes = wind1252.GetBytes(strHtml);
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);

score 13 · Accepted Answer · edited Mar 29 '12 at 21:30

Actually the problem lies here

byte[] wind1252Bytes = wind1252.GetBytes(strHtml);

We should not get the bytes from the html String. I tried the below code and it worked.

Encoding wind1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
byte[] wind1252Bytes = ReadFile(Server.MapPath(HtmlFile));
byte[] utf8Bytes = Encoding.Convert(wind1252, utf8, wind1252Bytes);
string utf8String = Encoding.UTF8.GetString(utf8Bytes);


public static byte[] ReadFile(string filePath)      
    {      
        byte[] buffer;   
        FileStream fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);  
        try
        {
            int length = (int)fileStream.Length;  // get file length    
            buffer = new byte[length];            // create buffer     
            int count;                            // actual number of bytes read     
            int sum = 0;                          // total number of bytes read    

            // read until Read method returns 0 (end of the stream has been reached)    
            while ((count = fileStream.Read(buffer, sum, length - sum)) > 0)
                sum += count;  // sum is a buffer offset for next reading
        }
        finally
        {
            fileStream.Close();
        }
        return buffer;
    }

OK; i don't think i'm catching it - you are saying don't GetBytes from the .net string, but rather binary-read it directly from the filesystem. Why does this work? b/c the .net string is internally UTF-16? — mlhDev, Mar 06 '12 at 15:09
i think he wants to say that system locale effects bytes so it never encode good so instead need to read real source to get real bytes and then convert. — Tommix, May 21 '15 at 12:53

Anton Semenov · Answer 3 · 2011-04-06T15:09:51.033

0

How you are planning to use resulting html? The most appropriate way in my opinion to solve your problem would be add meta with encoding specification. Something like:

<meta http-equiv="content-type" content="text/html;charset=UTF-8" />

edited Apr 06 '11 at 15:09

answered Apr 06 '11 at 15:04

Anton Semenov

6,227
5
41
69

score -1 · Answer 4 · answered Apr 06 '11 at 14:44

-1

Use Encoding.Convert method. Details are in the Encoding.Convert method MSDN article.

answered Apr 06 '11 at 14:44

Eugene Cheverda

8,760
2
33
18

1

Thanks for the Answer but I had tried that, don't know why its not working for me. – Varun0554 Apr 06 '11 at 14:47

Convert a string's character encoding from windows-1252 to utf-8

4 Answers4

Linked