Convert cyrilic to latin - latin intruders/exception

Question

I am using simple dictionary to replace Cyrillic letters with Latin ones and most of the time its working just fine but I am having issues when there are some Latin letters as an input. Most of the time its company names.

Few examples:

PROCRED is being converted as RROSRED

ОВЕХ as OVEH

CITY as SITU

What can I do about this?

This is the dictionary I am using

public string ConvertCyrillicToLatin(string text)
        {
            Dictionary<string, string> words = new Dictionary<string, string>();

            words.Add("А", "A");
            words.Add("Б", "B");
            words.Add("В", "V");
            words.Add("Г", "G");
            words.Add("Д", "D");
            words.Add("Ђ", "Đ");
            words.Add("Е", "E");
            words.Add("Ж", "Ž");
            words.Add("З", "Z");
            words.Add("И", "I");
            words.Add("Ј", "J");
            words.Add("К", "K");
            words.Add("Л", "L");
            words.Add("Љ", "Lj");
            words.Add("М", "M");
            words.Add("Н", "N");
            words.Add("Њ", "Nj");
            words.Add("О", "O");
            words.Add("П", "P");
            words.Add("Р", "R");
            words.Add("С", "S");
            words.Add("Т", "T");
            words.Add("Ћ", "Ć");
            words.Add("У", "U");
            words.Add("Ф", "F");
            words.Add("Х", "H");
            words.Add("Ц", "C");
            words.Add("Ч", "Č");
            words.Add("Џ", "Dž");
            words.Add("Ш", "Š");
            words.Add("а", "a");
            words.Add("б", "b");
            words.Add("в", "v");
            words.Add("г", "g");
            words.Add("д", "d");
            words.Add("ђ", "đ");
            words.Add("е", "e");
            words.Add("ж", "ž");
            words.Add("з", "z");
            words.Add("и", "i");
            words.Add("ј", "j");
            words.Add("к", "k");
            words.Add("л", "l");
            words.Add("љ", "lj");
            words.Add("м", "m");
            words.Add("н", "n");
            words.Add("њ", "nj");
            words.Add("о", "o");
            words.Add("п", "p");
            words.Add("р", "r");
            words.Add("с", "s");
            words.Add("т", "t");
            words.Add("ћ", "ć");
            words.Add("у", "u");
            words.Add("ф", "f");
            words.Add("х", "h");
            words.Add("ц", "c");
            words.Add("ч", "č");
            words.Add("џ", "dž");
            words.Add("ш", "š");

            var source = text;
            foreach (KeyValuePair<string, string> pair in words)
            {
                source = source.Replace(pair.Key, pair.Value);
            }

            return source;
        }

UPDATE 1

As requested in the comment, here is my exemption list:

"СIТУ":"CITY",
"OBEX":"OBEX"

Now it have just these two examples, for test, but its impossible to have a real functional exemption list with so many possibilities.

I am expecting that if application came across a Latin letter, just to ignore it and leave it as it is. Its already working like that for Latin letters which doesnt exist as Cyrillic or which exist but have the same meaning, like letters AEODGTEJKLMN... I am having issues with letters which looks the same in both Latin and Cyrillic alphabet but have different meaning, letters like С(S), Х(H), У(Y), P(R)...

UPDATE 2

Here are the few examples of input asked in the comment. The slash sign of course doesnt exit in the input, I just added it so that you can distinguish the Latin part

...ПОВЕРИОЦ /LЕNS OBEX DОО/, У СКЛАДУ СА ОДРЕДБОМ...

...ИЗЈАВА ПРИВРЕДНОГ ДРУШТВА /GRАDЈЕVINSКО РRЕDUZЕСЕ IМРЕХ LОZNIСА/ СА АДРЕСОМ...

...ЗА УГОВОР О ОТВАРАЊУ КРЕДИТНЕ ЛИНИЈЕ СА КОМПАНИЈОМ /"DOWN CITУ"/ И РАСПОН МЕСЕЧНЕ КАМАТНЕ СТОПЕ...

...КОРИСТ ПОВЕРИОЦА /ATР BANK TOUR/, СА СЕДИШТЕМ...

Not much, I'm afraid, at least not in a consistent way. OBEX converted from Cyrillic to Latin is indeed OVEH. Same for CITY/SITU. PROCRED is not a valid Cyrillic due to R, so it can't be transliterated. You could have an exemption list but that doesn't scale. — Zdeslav Vojkovic, Feb 28 '22 at 18:25
I do have an exemption list, but as you mentioned its not a proper solution... — sosNiLa, Mar 01 '22 at 07:32
do you think is it possible to solve this problem by using unicode/ansi? — sosNiLa, Mar 01 '22 at 07:39
I try to use `localization` to solve this issue, But it makes the code more complex and low efficiency... — Xinran Shen, Mar 02 '22 at 08:49
What is expected output if string contains any latin letter or any letter present in exemption list. Can you also add exemption list in your question — Prasad Telkikar, Mar 03 '22 at 08:45
@sosNiLa, did you check [How to transliterate Cyrillic to Latin text](https://stackoverflow.com/q/1841874/6299857) — Prasad Telkikar, Mar 03 '22 at 16:43
@PrasadTelkikar, yes I did, my solution is from that question and answers... — sosNiLa, Mar 04 '22 at 08:18
Can you give the exact test cases you use? The above examples don't use cyrillic letters as input. — PMF, Mar 05 '22 at 10:41
@PMF, I have added the few examples in UPDATE 2 part. Thanks — sosNiLa, Mar 05 '22 at 21:57

Jackdaw · Accepted Answer · 2022-03-08T02:47:24.777

In code below two dictionaries are used for converting text with Cyrillic character to the Latin. If a word contains the Latin characters the first LatinType dictionary is used. Otherwise the second CyrillicType is used.

class Program
{
    static void Main(string[] args)
    {
        var text = "...ПОВЕРИОЦ / LЕNS OBEX DОО/, У СКЛАДУ СА ОДРЕДБОМ..."
              + "...ИЗЈАВА ПРИВРЕДНОГ ДРУШТВА / GRАDЈЕVINSКО РRЕDUZЕСЕ IМРЕХ LОZNIСА / СА АДРЕСОМ..."
              + "...ЗА УГОВОР О ОТВАРАЊУ КРЕДИТНЕ ЛИНИЈЕ СА КОМПАНИЈОМ / DOWN CITУ / И РАСПОН МЕСЕЧНЕ КАМАТНЕ СТОПЕ...";

        var result = CyrillicToLatin.Convert(text);
    }
    public static class CyrillicToLatin
    {
        private static readonly Dictionary<string, string> ExclusionList = new()
            {
                { "ОТР COMPANY", "OTP COMPANY" }
            };

        private static readonly Dictionary<char, string> LatinType = new()
        {
            {'А', "A"},
            {'В', "B"},
            {'Е', "E"},
            {'К', "K"},
            {'М', "M"},
            {'Н', "H"},
            {'О', "O"},
            {'Р', "P"},
            {'С', "C"},
            {'Т', "T"},
            {'У', "Y"},
            {'Х', "X"}
        };

        private static readonly Dictionary<char, string> CyrillicType = new()
        {
            { 'А', "A" },
            { 'Б', "B" },
            { 'В', "V" },
            { 'Г', "G" },
            { 'Д', "D" },
            { 'Ђ', "Đ" },
            { 'Е', "E" },
            { 'Ж', "Ž" },
            { 'З', "Z" },
            { 'И', "I" },
            { 'Ј', "J" },
            { 'К', "K" },
            { 'Л', "L" },
            { 'Љ', "Lj" },
            { 'М', "M" },
            { 'Н', "N" },
            { 'Њ', "Nj" },
            { 'О', "O" },
            { 'П', "P" },
            { 'Р', "R" },
            { 'С', "S" },
            { 'Т', "T" },
            { 'Ћ', "Ć" },
            { 'У', "U" },
            { 'Ф', "F" },
            { 'Х', "H" },
            { 'Ц', "C" },
            { 'Ч', "Č" },
            { 'Џ', "Dž" },
            { 'Ш', "Š" },
            { 'а', "a" },
            { 'б', "b" },
            { 'в', "v" },
            { 'г', "g" },
            { 'д', "d" },
            { 'ђ', "đ" },
            { 'е', "e" },
            { 'ж', "ž" },
            { 'з', "z" },
            { 'и', "i" },
            { 'ј', "j" },
            { 'к', "k" },
            { 'л', "l" },
            { 'љ', "lj" },
            { 'м', "m" },
            { 'н', "n" },
            { 'њ', "nj" },
            { 'о', "o" },
            { 'п', "p" },
            { 'р', "r" },
            { 'с', "s" },
            { 'т', "t" },
            { 'ћ', "ć" },
            { 'у', "u" },
            { 'ф', "f" },
            { 'х', "h" },
            { 'ц', "c" },
            { 'ч', "č" },
            { 'џ', "dž" },
            { 'ш', "š" }
        };

        public static string Convert(string text)
        { 
            // Apply the exclusion list first               
            foreach (KeyValuePair<string, string> pair in ExclusionList)
            {
                text = text.Replace(pair.Key, pair.Value);
            }

            string pattern = @"[^,;()\s]+"; // Delimiters 

            var sb = new StringBuilder();
            var index = 0;

            foreach (Match match in Regex.Matches(text, pattern))
            {
                var dictionary = IsContainLatin(match.Value) ? LatinType : CyrillicType;
                var word = ConvertWord(match.Value, dictionary);
                if (index < match.Index)
                {
                    sb.Append(text[index..match.Index]);
                }
                sb.Append(word);
                index = match.Index + match.Length;
            }
            return sb.ToString();
        }

        private static string ConvertWord(string word, Dictionary<char, string> coding)
        {
            var result = new StringBuilder();
            foreach(char c in word)
            {
                string s = c.ToString();
                if (coding.TryGetValue(c, out string val))
                    s = val;
                result.Append(s);
            }
            return result.ToString();
        }

    private static bool IsContainLatin(string s)
        {
            foreach (char c in s)
                if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
                    return true;
            return false;
        }
    }
}

By this code the text from the "UPDATE 2" of the question will be coded to the following:

...POVERIOC / LENS OBEX DOO/, U SKLADU SA ODREDBOM......IZJAVA PRIVREDNOG DRUŠTVA / GRADЈEVINSKO PREDUZECE IMPEX LOZNICA / SA ADRESOM......ZA UGOVOR O OTVARANjU KREDITNE LINIJE SA KOMPANIJOM / DOWN CITY / I RASPON MESEČNE KAMATNE STOPE...

The source is just like that and I cant do nothing about it. Most of the time, the company names are in Latin and the rest is on Cyrillic. Regarding the exemption list, I am already doing it like that, but its so many options, the exemption list will be huge, plus it needs to be updated on daily basis, when and IF someone notice the error while converting... — sosNiLa, Mar 05 '22 at 22:02
@sosNiLa: Would it be correct to assume that the words to be recoded using the `words` list contain only the Cyrillic character? — Jackdaw, Mar 05 '22 at 23:34
thank you Jackdaw! I cant believe its working :) just its very very slow.... I also noticed 1 error for now, for example "ATP BANK" is being converted to "ATR BANK", "P" from "ATP" should remain "P" and not being converted to "R". — sosNiLa, Mar 07 '22 at 08:50
@sosNiLa: I think the performance might be improved. I'll check this issue. — Jackdaw, Mar 07 '22 at 09:00
here is the example for this issue "...КОРИСТ ПОВЕРИОЦА /ATР BANK TOUR/, СА СЕДИШТЕМ..." — sosNiLa, Mar 07 '22 at 09:07
@sosNiLa: It converted in my test to: **"...KORIST POVERIOCA /ATP BANK TOUR/, SA SEDIŠTEM..."**. Check that in the `LatinType` dictionary the line `{'Р', "P"},` contains the first Cyrillic and the second Latin characters. — Jackdaw, Mar 07 '22 at 09:10
@sosNiLa: Regarding the performance. Remove the `System.Diagnostics.Debug.WriteLine()` diagnostics printing. It was used for debugging purpose. And check the performance again, please. — Jackdaw, Mar 07 '22 at 09:22
I can confirm thats the LatinType is ok, I have copied it again from your post. Here is the original vs converted text on my side "У КОРИСТ ЗАЛОЖНОГ ПОВЕРИОЦА AТР ВАNКА TOUR, СА СЕДИШТЕМ" "U KORIST ZALOŽNOG POVERIOCA ATR BANK TOUR, SA SEDIŠTEM" — sosNiLa, Mar 07 '22 at 09:59
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/242676/discussion-between-jackdaw-and-sosnila). — Jackdaw, Mar 07 '22 at 10:00
oh and the sped is much better now after removing the System.Diagnostics.Debug.WriteLine() line — sosNiLa, Mar 07 '22 at 10:04

Convert cyrilic to latin - latin intruders/exception

1 Answers1