6

I've tried googling around but wasn't able to find what charset that this text below belongs to:

具有éœé›»ç”¢ç”Ÿè£ç½®ä¹‹å½±åƒè¼¸å…¥è£ç½®

But putting <meta http-equiv="Content-Type" Content="text/html; charset=utf-8"> and keeping that string into an HTML file, I was able to view the Chinese characters properly:

具有靜電產生裝置之影像輸入裝置 

So my question is:

  1. What tools can I use to detect the character set of this text?

  2. And how do I convert/encode/decode them properly in C#?

Updates: For completion sake, I've updated this test.

   [TestMethod]
    public void TestMethod1()
    {
        string encodedText = "具有éœé›»ç”¢ç”Ÿè£ç½®ä¹‹å½±åƒè¼¸å…¥è£ç½®";
        Encoding utf8 = new UTF8Encoding();
        Encoding window1252 = Encoding.GetEncoding("Windows-1252");

        byte[] postBytes = window1252.GetBytes(encodedText);
        
        string decodedText = utf8.GetString(postBytes);
        string actualText = "具有靜電產生裝置之影像輸入裝置";
        Assert.AreEqual(actualText, decodedText);
    }
}
smci
  • 32,567
  • 20
  • 113
  • 146
melaos
  • 8,386
  • 4
  • 56
  • 93
  • Possible duplicate: http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file. – lesderid Jun 10 '12 at 10:05
  • If you are only given a stream of bytes, you cannot *detect* whether it represents text in some encoding. You have to be *told* by whoever gave you the bytes. Check the documentations, manuals and protocol specifications of your data sources. – Kerrek SB Jun 10 '12 at 14:42
  • You should take a look at this great article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) http://www.joelonsoftware.com/articles/Unicode.html – Dusan Jun 10 '12 at 10:27
  • i know that and i have reread the document, but if that's the case why does for browser the characters show properly when charset encoding is set to UTF-8? What basic understanding am i missing here? – melaos Jun 10 '12 at 15:01

5 Answers5

9

What is happening when you save the "bad" string in a text file with a meta tag declaring the correct encoding is that your text editor is saving the file with Windows-1252 encoding, but the browser is reading the file and interpreting it as UTF-8. Since the "bad" string is incorrectly decoded UTF-8 bytes with the Windows-1252 encoding, you are reversing the process by encoding the file as Windows-1252 and decoding as UTF-8.

Here's an example:

using System.Text;
using System.Windows.Forms;

namespace Demo
{
    class Program
    {
        static void Main(string[] args)
        {
            string s = "具有靜電產生裝置之影像輸入裝置"; // Unicode
            Encoding Windows1252 = Encoding.GetEncoding("Windows-1252");
            Encoding Utf8 = Encoding.UTF8;
            byte[] utf8Bytes = Utf8.GetBytes(s); // Unicode -> UTF-8
            string badDecode = Windows1252.GetString(utf8Bytes); // Mis-decode as Latin1
            MessageBox.Show(badDecode,"Mis-decoded");  // Shows your garbage string.
            string goodDecode = Utf8.GetString(utf8Bytes); // Correctly decode as UTF-8
            MessageBox.Show(goodDecode, "Correctly decoded");

            // Recovering from bad decode...
            byte[] originalBytes = Windows1252.GetBytes(badDecode);
            goodDecode = Utf8.GetString(originalBytes);
            MessageBox.Show(goodDecode, "Re-decoded");
        }
    }
}

Even with correct decoding, you'll still need a font that supports the characters being displayed. If your default font doesn't support Chinese, you still might not see the correct characters.

The correct thing to do is figure out why the string you have was decoded as Windows-1252 in the first place. Sometimes, though, data in a database is stored incorrectly to begin with and you have to resort to these games to fix the problem.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • wow! thanks a lot, i've been meaning to understand what that garbage text is and finally your simple and clear cut explanation rocks! :) and yup i believe the initial data was inserted as garbage... have to find a way to clean that up – melaos Jun 10 '12 at 17:35
1
string test = "敭畳灴獩楫n"; //incoming data. must be mesutpiskin 

byte[] bytes = Encoding.Unicode.GetBytes(test);

string s = string.Empty;

for (int i = 0; i < bytes.Length; i++)
{
    s += (char)bytes[i];
}

s = s.Trim((char)0);

MessageBox.Show(s);
//s=mesutpiskin 
mesutpiskin
  • 1,771
  • 2
  • 26
  • 30
0

I'm not really sure what you mean, but I'm guessing you want to convert between a string in a certain encoding in byte array form and a string. Let's assume the character encoding is called "FooBar":

This is how you encode and decode:

Encoding myEncoding = Encoding.GetEncoding("FooBar");
string myString = "lala";
byte[] myEncodedBytes = myEncoding.GetBytes(myString);
string myDecodedString = myEncoding.GetString(myEncodedBytes);

You can learn more about the Encoding class over at MSDN.

lesderid
  • 3,388
  • 8
  • 39
  • 65
  • Basically i want to be able to get the second output string from the first input in c#, and i know that notepad and firefox can do it if i just set the charset to utf-8, i'm just trying to understand how to do i get that done in C#? is that clear? – melaos Jun 10 '12 at 10:28
  • Where are you getting the input string? From a file, user input, ...? – lesderid Jun 10 '12 at 10:31
  • pull from a table column data via linq to entities. – melaos Jun 10 '12 at 10:38
0

Answering your question at the end of your post:

  1. If you want to determine the text encoding on runtime you should look at that: http://code.google.com/p/ude/

  2. for converting character sets you can use http://msdn.microsoft.com/en-us/library/system.text.encoding.convert(v=vs.100).aspx

eyossi
  • 4,230
  • 22
  • 20
0

It's Windows Latin 1. I pasted the Chinese text as UTF-8 into BBEDIT (a text editor for Mac) and re-opened the file as Windows Latin 1 and bang, the exact diacritics appeared.

dda
  • 6,030
  • 2
  • 25
  • 34