1

I have a .smi file. When i open with notepad++ i had:

<font color="#FF8040"><I>- °øµ¿¹ø¿ªÀÌ´Ùº¸´Ï Áö¸í°ú À̸§ÀÌ ¾à°£¾¿ Ʋ¸±¼ö ÀÖ½À´Ï´Ù.-</I></font>

Then i set Character sets > Korean > UEC-KR:

<font color="#FF8040"><I>- 공동번역이다보니 지명과 이름이 약간씩 틀릴수 있습니다.  -</I></font>

So, what can i do it in C#? I want when i open a file, app can detect Character sets and display in a richtextbox. I used:

System.IO.StreamReader sr = new System.IO.StreamReader(openFile.FileName);
inputText.Text = sr.ReadToEnd();
inputText.SelectAll();
inputText.SelectionFont = new Font("Arial Unicode MS",9,FontStyle.Regular);

Result in inputText:

<font color="#FF8040"><I>- ���������̴ٺ��� ����� �̸��� �ణ�� Ʋ���� �ֽ��ϴ�.  -</I></font>
Cœur
  • 37,241
  • 25
  • 195
  • 267
hazymnc
  • 107
  • 1
  • 10

2 Answers2

1

You need to tell your StreamReader to use the appropriate encoding when it reads the file. You can achieve this by changing the first line with:

var krEncoding = System.Text.Encoding.GetEncoding("euc-kr");
System.IO.StreamReader sr = 
    new System.IO.StreamReader(openFile.FileName, krEncoding);

This is possible because the StreamReader constructor has an overload that accepts an encoding as an argument.

Cristian Lupascu
  • 39,078
  • 16
  • 100
  • 137
  • Can i make it automatic? Just open a file, app can get System.Text.Encoding.GetEncoding(Enconding code).. – hazymnc Jan 08 '14 at 09:44
  • @user3172506 No, please see the link provided by Mormegil as a comment to the question. There's no such thing as *reliable encoding detection*. The best you can do is run some heuristics that would work for a limited set of scenarios. [You basically have to know the encoding beforehand](http://www.joelonsoftware.com/articles/Unicode.html). – Cristian Lupascu Jan 08 '14 at 09:47
0

I haven seen any *.smi yet so my answer can be bogus to some point but:

  1. if the file is in raw 16bit unicode

    • then any char is coded as 2 bytes
    • look the file as binary and see the first two BYTEs
    • it should be FF,FE [hex]
    • it is the raw 16bit UNICODE coding signature
    • after that any character is set of two BYTEs: 'ASCII',codepage
    • read the codepage bytes and see what language it is...
    • or use full unicode font
  2. if the file is in UTF-8 / UTF-16

    • detect codepage from coding of extended characters (see unicode documentation)
    • or use full unicode font
  3. the data inside *.smi can be coded differently then the file itself

    • in that case look in smi documenattion for codepage tags
    • if it has none then you out of luck
    • if it does then you shoul use the tag for decoding ...

PS. there are not many unicode fonts out there and none is complete !!!
From the better (more pages supported) I know only of:

  • Quivira
  • unifont (be aware this one is bitmap font !!!)
  • but i have not done any research in the area for about a year so situation could change...
Spektre
  • 49,595
  • 11
  • 110
  • 380