Efficient and fast way to parse a string with different languages

Question

I have a string something like (generated via Google Transliterate REST call, and transliterated into 2 languages):

" This world is beautiful and थिस वर्ल्ड इस बेऔतिफुल एंड থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ amazingly mysterious अमज़िन्ग्ली म्य्स्तेरिऔस আমাজিন্গ্লি ম্য্স্তেরীয়ুস "

Now Google Transliterate REST call allows FIVE words at a time, so I had to loop, add it to the list and then concatenate the string. That's why we see that each CHUNK (of each language) is of 5 words. The total number of words is 7 words, so first 5 (This world is beautiful and) lies before rest 2 (amazingly mysterious) later.

How do I most efficiently parse the sentence such that I get something like:

This world is beautiful and amazingly mysterious थिस वर्ल्ड इस बेऔतिफुल एंड अमज़िन्ग्ली म्य्स्तेरिऔस থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ আমাজিন্গ্লি ম্য্স্তেরীয়ুস

Since the length of sentence, and the number of languages it can be converted into can be dynamic, may be using lists of each language can work, and then concatenated later?

I used an approach where I transliterated each word, one at a time, it works well, but too slow as it increases the number of calls to the API.

Can someone help me with an efficient (and dynamic) implementation of such a scenario? Thanks a bunch!

score 1 · Accepted Answer · answered Aug 31 '12 at 07:41

1

One list per language is the way to go.

answered Aug 31 '12 at 07:41

Daniel Hilgarth

171,043
40
335
443

Thanks for your message. I understand that, and that's what I was thinking but can you help in writing the logic? thanks. – Dev Dreamer Aug 31 '12 at 08:22
@DevDreamer: That's your job as a developer :-) Besides, I don't know anything about your application, existing structure etc. – Daniel Hilgarth Aug 31 '12 at 08:25
Thanks for your message. I think I'll end up with a non-efficient logic, that's why seeking help. And, the task is to convert the mixed language sentence into properly clustered one. I'll try to try though. Thanks. – Dev Dreamer Aug 31 '12 at 08:29
I finally implemented that. But man, what an exercise of brain that was! – Dev Dreamer Aug 31 '12 at 14:19
@DevDreamer: I am sure you are very proud that you did it yourself, without much help, aren't you? That's what the real fun is about software development: Solving all those little problems. – Daniel Hilgarth Aug 31 '12 at 14:26
1

so true. Thank you for encouraging me to do it on my own. Thanks. – Dev Dreamer Aug 31 '12 at 15:18

score 0 · Answer 2 · edited May 23 '17 at 10:34

0

if you mean different character ASCII code by different languages, you can use this answer here:

Regular expression Spanish and Arabic words

edited May 23 '17 at 10:34

Community

1
1

answered Aug 31 '12 at 07:43

Taha Paksu

15,371
2
44
78

thanks for your message, but the solution is using JavaScript. While I am using C#. – Dev Dreamer Aug 31 '12 at 08:31
they both use PCRE as I know as regex engine. – Taha Paksu Aug 31 '12 at 08:36

score 0 · Answer 3 · answered Aug 31 '12 at 07:45

0

Pay for google translate's API and then your length restriction goes up to 5,000 characters per request https://developers.google.com/translate/v2/faq

Also, yes, as Daniel has said - grouping the text by language will be necessary

answered Aug 31 '12 at 07:45

Andras Zoltan

41,961
13
104
160

thanks for your message, but I am using the "Transliterate" API. – Dev Dreamer Aug 31 '12 at 08:24

score 0 · Answer 4 · answered Aug 31 '12 at 11:01

0

I have tried a work out, correct me if i misinterpret your question

string statement = "This world is beautiful and थिस वर्ल्ड इस बेऔतिफुल एंड থিস বর্ল্ড ইস বিয়াউতিফুল আন্দ amazingly mysterious अमज़िन्ग्ली म्य्स्तेरिऔस আমাজিন্গ্লি ম্য্স্তেরীয়ুস ";
            string otherLangStmt = statement;

            MatchCollection matchCollection =   Regex.Matches(statement, "([a-zA-Z]+)");
            string result = "";
            foreach (Match match in matchCollection)
            {
                if (match.Groups.Count > 0)
                {
                    result += match.Groups[0].Value + " ";
                    otherLangStmt = otherLangStmt.Replace(match.Groups[0].Value, string.Empty);
                }                
            }
            otherLangStmt = Regex.Replace(otherLangStmt.Trim(), "[\\s]", " ");

            Console.WriteLine(result);
            Console.WriteLine(otherLangStmt);

answered Aug 31 '12 at 11:01

VIRA

1,454
1
16
25

thanks. That can be the first step, but what I want is a list of statements, where [0] is English, [1] is Hindi, [2] is Bengali. Through your method, I would retrieve the English, but Hindi and Bengali stay as it is. Thanks for help. – Dev Dreamer Aug 31 '12 at 11:11
How will you identify its a Hindi or Bengali? in terms of computer's point of view? – VIRA Aug 31 '12 at 11:15
I think its not about identifying Hindi or Bengali, its more dependent on the pattern in which I am getting the result. And parse the sentence on the basis of which. Thanks. – Dev Dreamer Aug 31 '12 at 11:31
I have to rephrase my sentence a bit :). we know that its english, hindi, etc., but i want to know how to let computer to understand that this is different language? – VIRA Aug 31 '12 at 11:37
that is not required, there's a pattern, and we need to grab the words placed at that particular location. Computer does not have to understand whether its Bengali or Hindi. – Dev Dreamer Aug 31 '12 at 11:41
I'm not sure about it, so if you come across any thing let me know. :) – VIRA Aug 31 '12 at 11:44

Efficient and fast way to parse a string with different languages

4 Answers4