How to detect character sets encoding for all of language?

Question

I have a .smi file. When i open with notepad++ i had:

<font color="#FF8040"><I>- °øµ¿¹ø¿ªÀÌ´Ùº¸´Ï Áö¸í°ú ÀÌ¸§ÀÌ ¾à°£¾¿ Æ²¸±¼ö ÀÖ½À´Ï´Ù.-</I></font>

Then i set Character sets > Korean > UEC-KR:

<font color="#FF8040"><I>- 공동번역이다보니 지명과 이름이 약간씩 틀릴수 있습니다.  -</I></font>

So, what can i do it in C#? I want when i open a file, app can detect Character sets and display in a richtextbox. I used:

System.IO.StreamReader sr = new System.IO.StreamReader(openFile.FileName);
inputText.Text = sr.ReadToEnd();
inputText.SelectAll();
inputText.SelectionFont = new Font("Arial Unicode MS",9,FontStyle.Regular);

Result in inputText:

<font color="#FF8040"><I>- ���������̴ٺ��� ����� �̸��� �ణ�� Ʋ���� �ֽ��ϴ�.  -</I></font>

score 1 · Accepted Answer · answered Jan 08 '14 at 09:38

1

You need to tell your StreamReader to use the appropriate encoding when it reads the file. You can achieve this by changing the first line with:

var krEncoding = System.Text.Encoding.GetEncoding("euc-kr");
System.IO.StreamReader sr = 
    new System.IO.StreamReader(openFile.FileName, krEncoding);

This is possible because the StreamReader constructor has an overload that accepts an encoding as an argument.

answered Jan 08 '14 at 09:38

Cristian Lupascu

39,078
16
100
137

Can i make it automatic? Just open a file, app can get System.Text.Encoding.GetEncoding(Enconding code).. – hazymnc Jan 08 '14 at 09:44
@user3172506 No, please see the link provided by Mormegil as a comment to the question. There's no such thing as *reliable encoding detection*. The best you can do is run some heuristics that would work for a limited set of scenarios. [You basically have to know the encoding beforehand](http://www.joelonsoftware.com/articles/Unicode.html). – Cristian Lupascu Jan 08 '14 at 09:47

score 0 · Answer 2 · answered Jan 08 '14 at 09:53

I haven seen any *.smi yet so my answer can be bogus to some point but:

if the file is in raw 16bit unicode
- then any char is coded as 2 bytes
- look the file as binary and see the first two BYTEs
- it should be FF,FE [hex]
- it is the raw 16bit UNICODE coding signature
- after that any character is set of two BYTEs: 'ASCII',codepage
- read the codepage bytes and see what language it is...
- or use full unicode font
if the file is in UTF-8 / UTF-16
- detect codepage from coding of extended characters (see unicode documentation)
- or use full unicode font
the data inside *.smi can be coded differently then the file itself
- in that case look in smi documenattion for codepage tags
- if it has none then you out of luck
- if it does then you shoul use the tag for decoding ...

PS. there are not many unicode fonts out there and none is complete !!!
From the better (more pages supported) I know only of:

Quivira
unifont (be aware this one is bitmap font !!!)
but i have not done any research in the area for about a year so situation could change...

How to detect character sets encoding for all of language?

2 Answers2