Detect language of text

Question

Is there any C# library which can detect the language of a particular piece of text? i.e. for an input text "This is a sentence", it should detect the language as "English". Or for "Esto es una sentencia" it should detect the language as "Spanish".

I understand that language detection from text is not a deterministic problem. But both Google Translate and Bing Translator have an "Auto detect" option, which best-guesses the input language. Is there something similar available publicly, preferably in C#?

Only the other day I saw one of my intranet webpages on a PC with Google Translator installed. The page just had a few words like **mean** and **stddev** and some numbers. Google Translator told me the page was in **Romanian** and asked if I wanted a translation. If it's not a **deterministic problem**, how can software do a good job? — pavium, Sep 23 '09 at 07:19
They do a good job sometimes. Of course there will be inputs for which they utterly fail, but for the more likely inputs they perform reasonably well — Vinko Vrsalovic, Sep 23 '09 at 07:22
http://stackoverflow.com/questions/1192768/return-the-language-of-a-given-string/1192802#1192802 — Magnus Johansson, Sep 23 '09 at 08:10
@pavium. Web search is a non-deterministic problem. Software does a decent job of solving that :). — Nikhil, Oct 10 '09 at 22:16
"decent job" is highly subjective... http://www.bing.com/search?q=linux and http://www.google.com/#q=linux give you trully different results - but I tend to have an opinion like yours. — ANeves, Apr 14 '10 at 13:33
See my c# implementation here: [http://stackoverflow.com/questions/1192768/return-the-language-of-a-given-string/14609043#14609043][1] [1]: http://stackoverflow.com/questions/1192768/return-the-language-of-a-given-string/14609043#14609043 — Sasvári Tamás, Jan 30 '13 at 16:44
"Sentencia" in spanish means 'sentence' as in prison sentence. "Oración" is what you mean — ealfonso, Nov 27 '13 at 03:36

Ivan Akcheurov · Answer 1 · 2020-09-29T14:12:04.037

33

Yes indeed, TextCat is very good for language identification. And it has a lot of implementations in different languages.

There were no ports in .Net. So I have written one: NTextCat (NuGet, Online Demo).

It is pure .NET Standard 2.0 DLL + command line interface to it. By default, it uses a profile of 14 languages.

Any feedback is very appreciated! New ideas and feature requests are welcomed too :)

edited Sep 29 '20 at 14:12

answered May 23 '11 at 19:04

Ivan Akcheurov

2,173
19
15

1

Tried NTextCat today, and it's very easy to work with! – Niels Bosma Aug 01 '11 at 15:24
Thanks for using it! Any particular feedback is very much appreciated. Please post your feedback (if any) [on this page](http://ntextcat.codeplex.com/discussions) – Ivan Akcheurov Aug 15 '11 at 14:58
Well, it didn't recognise Latvian.. – Mr. Blond Jul 17 '15 at 08:24
1

and didn't recognize Persian(Farsi). – Mohammad Sina Karvandi Nov 16 '15 at 17:43
Here you can find the NTextcat implementation in C# with full source code. https://codecanyon.net/item/language-detect/23356008?ref=intelliwins – sambit.albus Feb 26 '19 at 11:03
I don't have a file , I just want to pass a text and get the result – Kaveh Naseri Dec 07 '21 at 18:11
1

@KavehNaseri, "I don't have a file": if you mean the language model file, then you can download it here: https://github.com/ivanakcheurov/ntextcat/blob/master/src/LanguageModels/Core14.profile.xml `identifier.Identify("your text to get its language identified").FirstOrDefault()` would get you the language of your text if identified. – Ivan Akcheurov Dec 13 '21 at 19:41

score 3 · Answer 2 · answered Jan 30 '13 at 16:53

3

Please find a C# implementation based on of 3grams analysis here:

http://idsyst.hu/development/language_detector.html

answered Jan 30 '13 at 16:53

Sasvári Tamás

133
1
6

dreamlax · Answer 3 · 2009-09-23T13:38:19.327

Language detection is a pretty hard thing to do.

Some languages are much easier to detect than others simply due to the diacritics and digraphs/trigraphs used. For example, double-acute accents are used almost exclusively in Hungarian. The dotless i ‘ı’, is used exclusively [I think] in Turkish, t-comma (not t-cedilla) is used only in Romanian, and the eszett ‘ß’ occurs only in German.

Some digraphs, trigraphs and tetragraphs are also a good give-away. For example, you'll most likely find ‘eeuw’ and ‘ieuw’ primarily in Dutch, and ‘tsch’ and ‘dsch’ primarily in German etc.

More giveaways would include common words or common prefixes/suffixes used in a particular language. Sometimes even the punctuation that is used can help determine a language (quote-style and use, etc).

If such a library exists I would like to know about it, since I'm working on one myself.

You should think about a more generic n-grams based classifier based on a training corpus. — Luca Martinetti, Sep 21 '10 at 12:13

score 2 · Answer 4 · edited May 23 '17 at 11:54

2

Here you have a simple detector based on bigram statistics (basically means learning from a big set which bigrams occur more frequently on each language and then count those in a piece of text, comparing to your previously detected values):

http://allantech.blogspot.com/2007/07/automatic-language-detection.html

This is probably good enough for many (most?) applications and doesn't require Internet access.

Of course it will perform worse than Google's or Bing's algorithm (which themselves aren't great). If you need excellent detection performance you would have to do both a lot of hard work and over huge amounts of data.

The other option would be to leverage Google's or Bing APIs if your app has Internet access.

edited May 23 '17 at 11:54

Community

1
1

answered Sep 23 '09 at 07:18

Vinko Vrsalovic

330,807
53
334
373

1

In fact, this approach will give quite good results. It can be improved by using n-grams instead of bi-grams. However, it will always be difficult to tell very similar languages (e.g. Polish and Czech) apart. Languages such as Greek will be very easy though... – Dirk Vollmar Sep 23 '09 at 07:23
To avoid misunderstandings, what would you call quite good in this context? – Vinko Vrsalovic Sep 23 '09 at 07:36

score 0 · Answer 5 · answered Sep 23 '09 at 07:11

0

You'll want a machine learning algorithm based on hidden markov chains, process a bunch of texts in different languages.

Then when it gets to the unidentified text, the language that has the closer 'score' is the winner.

answered Sep 23 '09 at 07:11

Arafangion

11,517
1
40
72

score 0 · Answer 6 · answered Apr 14 '10 at 13:24

0

There is a simple tool to identify text language: http://www.detectlanguage.com/

answered Apr 14 '10 at 13:24

Laurynas

3,829
2
32
22

It detects Persian (Farsi) as Arabic sometimes so it's not very accurate but still a good effort. – Amir Hajiha Jun 26 '18 at 14:16
1

Hi Amir, Persian (Farsi) detection was improved recently - worth trying again. – Laurynas Jul 02 '18 at 17:57

score 0 · Answer 7 · answered Apr 14 '10 at 13:30

I've found that "textcat" is very useful for this. I've used a PHP implementation, PHP Text Cat, based on this this original implementation, and found it reliable. If you have a look at the sources, you'll find it's not a terrifyingly difficult thing to implement in the language of your choice. The hard work -- the letter combinations that are relevant to a particular language -- is all in there as data.

Detect language of text

7 Answers7

Linked