2

I try to add lua-scripts in C# project with NLua library (nlua.org). My problem is incorrect representation of Cyrillic symbols in string values. My c# code is:

Lua lua = new Lua();
lua.DoFile("script.lua");
Console.WriteLine(lua["var"]);

The script file code is:

var = 'кириллица'

Changing of script file encoding does not help me. Also I tried to search the correct script file encoding with this code:

foreach (EncodingInfo ei in Encoding.GetEncodings()) {
    Encoding e = ei.GetEncoding ();
    string s1 = "cyrillic кириллица";
    System.IO.File.Delete ("script.lua");
    System.IO.File.AppendAllText ("script.lua", "var = '" + s1 + "'", e);
    string s2;
    try {
        Lua lua = new Lua ();
        lua.DoFile ("script.lua");
        s2 = lua ["var"] as string;
    } catch {
        s2 = "error in lua";
    }
    Console.WriteLine ("[{0}]\t({1})", s2, e.HeaderName);
}

Here is console output:

[error in lua] (IBM037)
[cyrillic ?????????] (IBM437)
[error in lua] (IBM500)
[cyrillic ?????????] (asmo-708)
[cyrillic ?????????] (ibm850)
[cyrillic ?????????] (ibm852)
[cyrillic Æ·á·Ðз¤*] (ibm855)
[cyrillic ?????????] (ibm857)
[cyrillic ?????????] (IBM00858)
[cyrillic ?????????] (ibm860)
[cyrillic ?????????] (ibm861)
[cyrillic ?????????] (ibm861)
[cyrillic ?????????] (IBM863)
[cyrillic ?????????] (ibm864)
[cyrillic ?????????] (IBM865)
[cyrillic ª¨à¨««¨æ*] (ibm866)
[cyrillic ?????????] (ibm869)
[error in lua] (ibm870)
[cyrillic ?????????] (windows-874)
[error in lua] (ibm875)
[cyrillic
{
y
‚
y
|
|
y

p] (iso-2022-jp)
[cyrillic §Ü§Ú§â§Ú§Ý§Ý§Ú§è§Ñ] (gb2312)
[cyrillic ¬Ü¬Ú¬â¬Ú¬Ý¬Ý¬Ú¬è¬Ñ] (ks_c_5601-1987)
[cyrillic ?????????] (big5)
[error in lua] (ibm1026)
[error in lua] (ibm1047)
[error in lua] (IBM01140)
[error in lua] (IBM01141)
[error in lua] (IBM01142)
[error in lua] (IBM01143)
[error in lua] (ibm1144)
[error in lua] (ibm1145)
[error in lua] (ibm1146)
[error in lua] (ibm1147)
[error in lua] (ibm1148)
[error in lua] (ibm1149)
[error in lua] (utf-16)
[error in lua] (utf-16BE)
[cyrillic ?????????] (windows-1250)
[cyrillic êèðèëëèöà] (windows-1251)
[cyrillic ?????????] (Windows-1252)
[cyrillic ?????????] (windows-1253)
[cyrillic ?????????] (windows-1254)
[cyrillic ?????????] (windows-1255)
[cyrillic ?????????] (windows-1256)
[cyrillic ?????????] (windows-1257)
[cyrillic ?????????] (windows-1258)
[cyrillic ?????????] (macintosh)
[cyrillic ?????????] (x-mac-icelandic)
[error in lua] (utf-32)
[error in lua] (utf-32BE)
[cyrillic ?????????] (us-ascii)
[error in lua] (IBM273)
[error in lua] (IBM277)
[error in lua] (IBM278)
[error in lua] (IBM280)
[error in lua] (IBM284)
[error in lua] (IBM285)
[error in lua] (IBM290)
[error in lua] (IBM297)
[error in lua] (IBM420)
[error in lua] (IBM424)
[cyrillic ËÉÒÉÌÌÉÃÁ] (koi8-r)
[error in lua] (IBM871)
[error in lua] (IBM1025)
[cyrillic ËÉÒÉÌÌÉÃÁ] (koi8-u)
[cyrillic ?????????] (iso-8859-1)
[cyrillic ?????????] (iso-8859-2)
[cyrillic ?????????] (iso-8859-3)
[cyrillic ?????????] (iso-8859-4)
[cyrillic ÚØàØÛÛØæÐ] (iso-8859-5)
[cyrillic ?????????] (iso-8859-6)
[cyrillic ?????????] (iso-8859-7)
[cyrillic ?????????] (iso-8859-8)
[cyrillic ?????????] (iso-8859-9)
[cyrillic ?????????] (iso-8859-15)
[cyrillic ?????????] (windows-38598)
[cyrillic ?????????] (iso-2022-jp)
[cyrillic ?????????] (iso-2022-jp)
[cyrillic ?????????] (iso-2022-jp)
[cyrillic §Ü§Ú§â§Ú§Ý§Ý§Ú§è§Ñ] (euc-jp)
[cyrillic ¬Ü¬Ú¬â¬Ú¬Ý¬Ý¬Ú¬è¬Ñ] (euc-kr)
[cyrillic §Ü§Ú§â§Ú§Ý§Ý§Ú§è§Ñ] (GB18030)
[cyrillic ?????????] (x-iscii-de)
[cyrillic ?????????] (x-iscii-be)
[cyrillic ?????????] (x-iscii-ta)
[cyrillic ?????????] (x-iscii-te)
[cyrillic ?????????] (x-iscii-as)
[cyrillic ?????????] (x-iscii-or)
[cyrillic ?????????] (x-iscii-ka)
[cyrillic ?????????] (x-iscii-ma)
[cyrillic ?????????] (x-iscii-gu)
[cyrillic ?????????] (x-iscii-pa)
[error in lua] (utf-7)
[error in lua] (utf-8) 

You can see that there is no correct variant at all. So I don't know how to fix that.

Egor Skriptunoff
  • 23,359
  • 2
  • 34
  • 64
dmitry1204
  • 21
  • 4
  • `Console.WriteLine` is probably expecting 866 codepage, but your string is probably win1251. Can you invoke WinAPI `CharToOem()` just before `Console.WriteLine`? – Egor Skriptunoff Aug 21 '16 at 10:42
  • If this does not help, then please give the following information: 1) Write the following code `var = 'кириллица';print(var:byte(1,-1))` to your `script.lua` and show the output; 2) Don't change the encoding of your script file, try changing the encoding of `s2` variable instead (I don't know how to do that in C#). – Egor Skriptunoff Aug 21 '16 at 10:49
  • @EgorSkriptunoff Thanks for reply. I don't know how to use CharToOem in c#. The console output of `var = 'кириллица';print(var:byte(1,-1))` is `208 186 208 184 209 128 208 184 208 187 208 187 208 184 209 134 208 176` – dmitry1204 Aug 21 '16 at 11:13
  • Ok, your Lua file is in UTF-8 encoding. Now, to make sure that C# receives this string properly, try to output variable `s2` into GUI window instead of console, e.g., `System.Windows.MessageBox.Show(s2);` – Egor Skriptunoff Aug 21 '16 at 11:27
  • It is not necessary. I can see incorrect symbols using debugger. – dmitry1204 Aug 21 '16 at 11:59
  • Probably, you are working on Windows having European locale settings? It is strange that your console displays symbols from win1252 codepage. Check your settings: `HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP` and `...\OEMCP` – Egor Skriptunoff Aug 21 '16 at 12:21
  • @EgorSkriptunoff it is linux. On windows the problem is the same. But I have not window pc now. – dmitry1204 Aug 21 '16 at 12:37
  • On Linux you can fix it by adding 1251 locale to the system (add the line `ru_RU.CP1251 CP1251` in the file `/var/lib/locales/supported.d/ru` and run `dpkg-reconfigure locales`) and then use this locale by invoking `LC_ALL=ru_RU.CP1251 ./your_program` – Egor Skriptunoff Aug 21 '16 at 12:45

2 Answers2

0

I have done this thing now. Here is my code:

foreach (EncodingInfo ei1 in Encoding.GetEncodings()) {
                Encoding e1 = ei1.GetEncoding ();
                string s1 = "кириллица";
                System.IO.File.Delete ("script.lua");
                System.IO.File.AppendAllText ("script.lua", "var = '" + s1 + "'", e1);
                string s2;
                try {
                    Lua lua = new Lua ();
                    lua.DoFile ("script.lua");
                    s2 = lua ["var"] as string;
                    foreach (EncodingInfo ei2 in Encoding.GetEncodings()) {
                        Encoding e2 = ei2.GetEncoding ();
                        byte[] bytes = e2.GetBytes (s2);
                        foreach (EncodingInfo ei3 in Encoding.GetEncodings()) {
                            try {
                                Encoding e3 = ei3.GetEncoding ();
                                string s3 = e3.GetString (bytes);
                                if (s1 == s3)
                                    Console.WriteLine ("({0})=>({1})=>({2}):[{3}]",e1.HeaderName, e2.HeaderName, e3.HeaderName, s3);
                            } catch { }
                        }
                    }
                } catch { }
            }

I try to write script file in every encoding. Than read the value and convert it from every encoding to every encoding. And after that I compare initial text with final. Console output represents correct variants:

(ibm855)=>(Windows-1252)=>(ibm855):[кириллица]
(ibm855)=>(iso-8859-1)=>(ibm855):[кириллица]
(ibm866)=>(Windows-1252)=>(ibm866):[кириллица]
(ibm866)=>(windows-1254)=>(ibm866):[кириллица]
(ibm866)=>(windows-1258)=>(ibm866):[кириллица]
(ibm866)=>(iso-8859-1)=>(ibm866):[кириллица]
(ibm866)=>(iso-8859-9)=>(ibm866):[кириллица]
(iso-2022-jp)=>(asmo-708)=>(iso-2022-jp):[кириллица]
(iso-2022-jp)=>(iso-8859-1)=>(iso-2022-jp):[кириллица]
(iso-2022-jp)=>(iso-8859-2)=>(iso-2022-jp):[кириллица]
(iso-2022-jp)=>(iso-8859-3)=>(iso-2022-jp):[кириллица]
(iso-2022-jp)=>(iso-8859-4)=>(iso-2022-jp):[кириллица]
(iso-2022-jp)=>(iso-8859-5)=>(iso-2022-jp):[кириллица]
(iso-2022-jp)=>(iso-8859-6)=>(iso-2022-jp):[кириллица]
(iso-2022-jp)=>(iso-8859-7)=>(iso-2022-jp):[кириллица]
(iso-2022-jp)=>(iso-8859-8)=>(iso-2022-jp):[кириллица]
(iso-2022-jp)=>(iso-8859-9)=>(iso-2022-jp):[кириллица]
(iso-2022-jp)=>(iso-8859-15)=>(iso-2022-jp):[кириллица]
(iso-2022-jp)=>(windows-38598)=>(iso-2022-jp):[кириллица]
(gb2312)=>(Windows-1252)=>(gb2312):[кириллица]
(gb2312)=>(Windows-1252)=>(euc-jp):[кириллица]
(gb2312)=>(Windows-1252)=>(GB18030):[кириллица]
(gb2312)=>(iso-8859-1)=>(gb2312):[кириллица]
(gb2312)=>(iso-8859-1)=>(euc-jp):[кириллица]
(gb2312)=>(iso-8859-1)=>(GB18030):[кириллица]
(gb2312)=>(iso-8859-15)=>(gb2312):[кириллица]
(gb2312)=>(iso-8859-15)=>(euc-jp):[кириллица]
(gb2312)=>(iso-8859-15)=>(GB18030):[кириллица]
(ks_c_5601-1987)=>(Windows-1252)=>(ks_c_5601-1987):[кириллица]
(ks_c_5601-1987)=>(Windows-1252)=>(euc-kr):[кириллица]
(ks_c_5601-1987)=>(iso-8859-1)=>(ks_c_5601-1987):[кириллица]
(ks_c_5601-1987)=>(iso-8859-1)=>(euc-kr):[кириллица]
(ks_c_5601-1987)=>(iso-8859-15)=>(ks_c_5601-1987):[кириллица]
(ks_c_5601-1987)=>(iso-8859-15)=>(euc-kr):[кириллица]
(windows-1251)=>(Windows-1252)=>(windows-1251):[кириллица]
(windows-1251)=>(iso-8859-1)=>(windows-1251):[кириллица]
(windows-1251)=>(iso-8859-15)=>(windows-1251):[кириллица]
(koi8-r)=>(Windows-1252)=>(koi8-r):[кириллица]
(koi8-r)=>(Windows-1252)=>(koi8-u):[кириллица]
(koi8-r)=>(windows-1254)=>(koi8-r):[кириллица]
(koi8-r)=>(windows-1254)=>(koi8-u):[кириллица]
(koi8-r)=>(iso-8859-1)=>(koi8-r):[кириллица]
(koi8-r)=>(iso-8859-1)=>(koi8-u):[кириллица]
(koi8-r)=>(iso-8859-9)=>(koi8-r):[кириллица]
(koi8-r)=>(iso-8859-9)=>(koi8-u):[кириллица]
(koi8-r)=>(iso-8859-15)=>(koi8-r):[кириллица]
(koi8-r)=>(iso-8859-15)=>(koi8-u):[кириллица]
(koi8-u)=>(Windows-1252)=>(koi8-r):[кириллица]
(koi8-u)=>(Windows-1252)=>(koi8-u):[кириллица]
(koi8-u)=>(windows-1254)=>(koi8-r):[кириллица]
(koi8-u)=>(windows-1254)=>(koi8-u):[кириллица]
(koi8-u)=>(iso-8859-1)=>(koi8-r):[кириллица]
(koi8-u)=>(iso-8859-1)=>(koi8-u):[кириллица]
(koi8-u)=>(iso-8859-9)=>(koi8-r):[кириллица]
(koi8-u)=>(iso-8859-9)=>(koi8-u):[кириллица]
(koi8-u)=>(iso-8859-15)=>(koi8-r):[кириллица]
(koi8-u)=>(iso-8859-15)=>(koi8-u):[кириллица]
(iso-8859-5)=>(Windows-1252)=>(iso-8859-5):[кириллица]
(iso-8859-5)=>(iso-8859-1)=>(iso-8859-5):[кириллица]
(iso-8859-5)=>(iso-8859-15)=>(iso-8859-5):[кириллица]
(euc-jp)=>(Windows-1252)=>(gb2312):[кириллица]
(euc-jp)=>(Windows-1252)=>(euc-jp):[кириллица]
(euc-jp)=>(Windows-1252)=>(GB18030):[кириллица]
(euc-jp)=>(iso-8859-1)=>(gb2312):[кириллица]
(euc-jp)=>(iso-8859-1)=>(euc-jp):[кириллица]
(euc-jp)=>(iso-8859-1)=>(GB18030):[кириллица]
(euc-jp)=>(iso-8859-15)=>(gb2312):[кириллица]
(euc-jp)=>(iso-8859-15)=>(euc-jp):[кириллица]
(euc-jp)=>(iso-8859-15)=>(GB18030):[кириллица]
(euc-kr)=>(Windows-1252)=>(ks_c_5601-1987):[кириллица]
(euc-kr)=>(Windows-1252)=>(euc-kr):[кириллица]
(euc-kr)=>(iso-8859-1)=>(ks_c_5601-1987):[кириллица]
(euc-kr)=>(iso-8859-1)=>(euc-kr):[кириллица]
(euc-kr)=>(iso-8859-15)=>(ks_c_5601-1987):[кириллица]
(euc-kr)=>(iso-8859-15)=>(euc-kr):[кириллица]
(GB18030)=>(Windows-1252)=>(gb2312):[кириллица]
(GB18030)=>(Windows-1252)=>(euc-jp):[кириллица]
(GB18030)=>(Windows-1252)=>(GB18030):[кириллица]
(GB18030)=>(iso-8859-1)=>(gb2312):[кириллица]
(GB18030)=>(iso-8859-1)=>(euc-jp):[кириллица]
(GB18030)=>(iso-8859-1)=>(GB18030):[кириллица]
(GB18030)=>(iso-8859-15)=>(gb2312):[кириллица]
(GB18030)=>(iso-8859-15)=>(euc-jp):[кириллица]
(GB18030)=>(iso-8859-15)=>(GB18030):[кириллица]

Here first name is file's encoding. Than the value converts as it has second encoding to third (I hope you understand). Look at this line for example:

(windows-1251)=>(Windows-1252)=>(windows-1251):[кириллица]

It means that script file was written in windows-1251. But if you want to get the correct text, you need to convert it from windows-1252 to windows-1251 encoding. I don't know, is it NLua issue or something else.

dmitry1204
  • 21
  • 4
  • That's because C# considers `lua ["var"] as string` as if it was encoded in win1252. The simplest solution is to change your Windows locale to cyrillic (change the language of non-Unicode programs in Windows Control Panel) and use win1251 everywhere, no conversions will be needed. – Egor Skriptunoff Aug 21 '16 at 12:40
  • A string in Lua is a sequence of bytes. So, the encoding of the script file totally matters. There are always several to choose from. The choice has to be known—either explicitly or by convention—by all components that treat bytes as text. There are lots of conversions (null-conversion or actual) going on. `dofile` would be one place. `lua ["var"]` is another. And, `Console.WriteLine` is another. Ideally, each place would allow you to be explicit. This is a problem where you have to divide and conquer or just accept that you don't have control and learn each point's convention. – Tom Blodget Aug 21 '16 at 13:48
  • @TomBlodget Oh. It's a bit more complicated than I expected. But there is nothing to do. Is my solution with string conversion ok? Or is there a simpler way? – dmitry1204 Aug 21 '16 at 14:47
  • I don't have any specific suggestion except that a console-side problem is easy to solve. Given that a .NET string's encoding is UTF-16 and that Console.WriteLine converts to Console.OutputEncoding, you just have make sure that the console's encoding matches (chcp or locale) and ensure that console's font can display the characters you need. For any other problem you need to look inside NLua. – Tom Blodget Aug 21 '16 at 17:02
0

I found simple solution for my project. I use

lua.DoString (System.IO.File.ReadAllText ("script.lua", enc));

instead of

lua.DoFile ("script.lua");

Here enc - is my script file encoding.

dmitry1204
  • 21
  • 4