C# string not supporting cyrillic chars

Question

I am driving nuts with C# encoding, trying to store cyrillic characters in a string, and so far I haven't found a solution.

For example, if I execute the following code:

string test = "АЗУОЫЯЕЁЮИ";

The test variable will contain two question marks for each character instead the character itself.

It seems it is using ASCII for encoding, but I thought in C# all strings were UTF8 by default, but if it is using ASCII instead I didn't find a way to change it, so I don't know what to do.

I am using the Mono Develop that comes in the bundle within the Unity game engine, under OSX Yosemite. I DO save such files as UTF8 and I have double-checked it with iconv, just in case Mono Develop wasn't doing it right. They are UTF8 without doubt at all.

I have took a look on C# documentation about encoding, but I am afraid I haven't understood it very well, since I didn't find anything that could help me with this problem.

EDIT: I am adding this code, because it shows the problem is not just a matter of what you see, but something about internal encoding itself. (BTW, that "А" character is not an ASCII "A" but a Russian cyrillic "А"):

            // Debug code
            string one = "А";
            string two = "А";
            string three = "З";         
            string logMessageOne = (one == two) ? "One is equal to Two" : "One is different than Two";
            string logMessageTwo = (one == three) ? "One is equal to Three" : "One is different than Three";
            string logMessageThree = (one.CompareTo (three) == 0) ? "One is equal to Three" : "One is different than Three";

In all cases it says that all strings are equal.

it matters a lot where yu see those ? .. ie the output must be properly formatted respectively support utf16 strings. You may simply be looking at the raw encoded string rather than what it would appear to be in a unicode-enabled label. — CodeSmile, Jan 12 '15 at 22:39
http://stackoverflow.com/questions/5055659/c-sharp-unicode-string-output — MethodMan, Jan 12 '15 at 22:39
@LearnCocos2D Actually it does matter, because it is not what I am saying, it is something wrong internal: when you compare two of such characters within two different strings, it says both are equal even if they are different. — Fran Marzoa, Jan 12 '15 at 22:53
@MethodMan Thanks, I know how to use the search myself, that doesn't resolve my problem indeed. He seems to have a problem with Windows encoding configuration, fonts or whatever, and I am not even using Windows at all, but OSX. And I am saving all my code files as UTF-8. — Fran Marzoa, Jan 12 '15 at 22:55
@RufusL This is my environment: OSX Yosemite + Unity 4.6.1.f1 + Mono Develop-Unity 4.0.1 (is the one that comes bundled with Unity itself). Maybe is another Unity specific problem... — Fran Marzoa, Jan 13 '15 at 12:26
c# strings are unicode (not necessarily utf-8), and your debug code runs fine in .NET using the .NET compiler (check https://dotnetfiddle.net/CKx4Dp ). If it doesn't in Unity, then there's probably a bug. — Jcl, Jan 13 '15 at 12:40

score 2 · Answer 1 · answered Jan 13 '15 at 08:08

2

Every file with Unicode characters needs to be encoded as utf8 with bom to work in unity. By default, monodevelop does not do that (plain utf8), at least on osx.

On Windows, edit this file in notepad++ or similar and change encoding to utf8 with bom. If you're on osx, I can send you a tool for that.

If you add bom, it usually stays there, no need to repeat this every save.

answered Jan 13 '15 at 08:08

Krzysztof Bociurko

4,575
2
26
44

I am using OSX and I do take care of encoding all my code files as UTF8. Moreover, to discard such problem I have used iconv to check the file encoding, even I have "reconverted" it. So definitely, the code file encoding is NOT the problem. I have added this information to the question to avoid further mistakes. Thanks. – Fran Marzoa Jan 13 '15 at 12:29
UTF8 is something different than [UTF8+BOM](http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8). Unity requires UTF8+BOM, almost everything uses UTF8 without BOM, that includes new files in monodevelop. The encoding is most likely the problem, and adding bom with iconv is not very straightforward. Try [this tool](https://gist.github.com/chanibal/397c39d59ede8682ce13), it modifies one file in place to add BOM. – Krzysztof Bociurko Jan 13 '15 at 12:47
I missed the BOM part of your comment before, and re-found it searching around. So you are right. Anyway I have solved it without needing any external tool. – Fran Marzoa Jan 13 '15 at 12:52

Fran Marzoa · Answer 2 · 2015-01-13T14:03:47.387

OK, I finally managed to figure out the problem and solve it. It is clearly another bug more in Unity editor: it does not only want UTF-8 files, but they MUST have the BOM, despite such bytes are optional according to UTF-8 specification. To make things worse, the Mono Develop environment distributed with the same Unity game engine does NOT save UTF-8 with the BOM, so I finally ended up adding it manually just to try and it worked.

Just three steps in OSX command line:

cp KeyboardRussian.cs aux
echo -ne '\xEF\xBB\xBF' > KeyboardRussian.cs
cat aux >> KeyboardRussian.cs

And it worked like charm.

For the sake of credit, ChanibaL mentioned the BOM in his answer, though I didn't notice it.

In any case with this solution you don't need any additional tool in OSX, and for Windows probably you just need to make minor changes:

copy KeyboardRussian.cs aux
echo -ne '\xEF\xBB\xBF' > KeyboardRussian.cs
type aux >> KeyboardRussian.cs

Be awarer that I haven't tested that in Windows, despite it should work.

score 0 · Answer 3 · answered Jan 12 '15 at 22:59

maybe you can use a dictionary, and then compare the strings:

        var map = new Dictionary<char, string>
            {
                {'а', "a"},
                {'б', "b"},
                {'в', "v"},
                {'г', "g"},
                {'д', "d"},
                {'е', "e"},
                {'ё', "yo"},
                {'ж', "zh"},
                {'з', "z"},
                {'и', "i"},
                {'й', "j"},
                {'к', "k"},
                {'л', "l"},
                {'м', "m"},
                {'н', "n"},
                {'о', "o"},
                {'п', "p"},
                {'р', "r"},
                {'с', "s"},
                {'т', "t"},
                {'у', "u"},
                {'ф', "f"},
                {'х', "h"},
                {'ц', "c"},
                {'ч', "ch"},
                {'ш', "sh"},
                {'щ', "sch"},
                {'ъ', "j"},
                {'ы', "i"},
                {'ь', "j"},
                {'э', "e"},
                {'ю', "yu"},
                {'я', "ya"},
                {'А', "A"},
                {'Б', "B"},
                {'В', "V"},
                {'Г', "G"},
                {'Д', "D"},
                {'Е', "E"},
                {'Ё', "Yo"},
                {'Ж', "Zh"},
                {'З', "Z"},
                {'И', "I"},
                {'Й', "J"},
                {'К', "K"},
                {'Л', "L"},
                {'М', "M"},
                {'Н', "N"},
                {'О', "O"},
                {'П', "P"},
                {'Р', "R"},
                {'С', "S"},
                {'Т', "T"},
                {'У', "U"},
                {'Ф', "F"},
                {'Х', "H"},
                {'Ц', "C"},
                {'Ч', "Ch"},
                {'Ш', "Sh"},
                {'Щ', "Sch"},
                {'Ъ', "J"},
                {'Ы', "I"},
                {'Ь', "J"},
                {'Э', "E"},
                {'Ю', "Yu"},
                {'Я', "Ya"}
            };
        var LatinText = string.Concat("АЗУОЫЯЕЁЮИ".Select(c => map[c]));
        Console.WriteLine(LatinText.ToString());

Hope this help.

Thanks, I'll give a try though I am quite skeptical about this working: if it has problems encoding strings, it will probably have problems also with those char keys there. But who knows... — Fran Marzoa, Jan 12 '15 at 23:00

C# string not supporting cyrillic chars

3 Answers3