NLTK and language detection

Question

How do I detect what language a text is written in using NLTK?

The examples I've seen use nltk.detect, but when I've installed it on my mac, I cannot find this package.

The `langid` and `langdetect` libraries do the trick and are super easy to use: https://github.com/hb20007/hands-on-nltk-tutorial/blob/master/8-1-The-langdetect-and-langid-Libraries.ipynb — hb20007, May 17 '18 at 12:55
`langdetect` is not very reliable (e.g. check https://github.com/Mimino666/langdetect/issues/51 for instance) and `langid` choked on a test Japanese string when I tested it. YMMV. In 2019, if you are not tied to NLTK, I'd recommend you take a look at `cld2`, `cld3` or `fastText` instead. — Mathieu Rey, Mar 19 '19 at 13:35
Nicely summarized here https://stackoverflow.com/a/48436520/2063605 — SNA, Jan 17 '20 at 09:51

score 43 · Accepted Answer · edited Oct 14 '17 at 22:22

43

Have you come across the following code snippet?

english_vocab = set(w.lower() for w in nltk.corpus.words.words())
text_vocab = set(w.lower() for w in text if w.lower().isalpha())
unusual = text_vocab.difference(english_vocab)

from http://groups.google.com/group/nltk-users/browse_thread/thread/a5f52af2cbc4cfeb?pli=1&safe=active

Or the following demo file?

https://web.archive.org/web/20120202055535/http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/misc/langid.py

edited Oct 14 '17 at 22:22

Mark Cramer

2,614
5
33
57

answered Aug 02 '10 at 02:34

William Niu

15,798
7
53
93

PS, it still relied on nltk.detect, though. Any idea on how to install that on a Mac? – niklassaers Aug 03 '10 at 09:59
I don't believe detect is a native module for nltk. Here's the code: http://docs.huihoo.com/nltk/0.9.5/api/nltk.detect-pysrc.html You could probably download it and put it in your python library, which may be in: /Library/Python/2.x/site-packages/nltk... – William Niu Aug 03 '10 at 13:53
3

Check this out.. http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/ – Anoop Toffy Apr 08 '16 at 05:46
The requested URL /p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/misc/langid.py was not found on this server. That’s all we know. – Mona Jalal Mar 07 '17 at 00:40
This is such a good answer. The simplicity of checking if the words are in the vocab is an amazingly direct approach to this kind of task. Granted it doesn't give you the actual language or translate, but if you simply need to know if it's an outlier, this is brilliant. – whege Jul 15 '22 at 18:34

score 30 · Answer 2 · edited Mar 07 '17 at 00:42

30

This library is not from NLTK either but certainly helps.

$ sudo pip install langdetect

Supported Python versions 2.6, 2.7, 3.x.

>>> from langdetect import detect

>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'

https://pypi.python.org/pypi/langdetect?

P.S.: Don't expect this to work correctly always:

>>> detect("today is a good day")
'so'
>>> detect("today is a good day.")
'so'
>>> detect("la vita e bella!")
'it'
>>> detect("khoobi? khoshi?")
'so'
>>> detect("wow")
'pl'
>>> detect("what a day")
'en'
>>> detect("yay!")
'so'

edited Mar 07 '17 at 00:42

Mona Jalal

34,860
64
239
408

answered Aug 03 '16 at 19:39

SVK

1,004
11
25

2

Thank you for pointing out that it doesn't always work. `detect("You made it home!")` is giving me "fr". I'm wondering if there is anything better. – Mark Cramer Oct 14 '17 at 03:43
2

Here is another fun observation: It doesn't seem to give the same answer each time. `>>> detect_langs("Hello, I'm christiane amanpour.") [it:0.8571401485770536, en:0.14285811674731527] >>> detect_langs("Hello, I'm christiane amanpour.") [it:0.8571403121803622, fr:0.14285888197332486] >>> detect_langs("Hello, I'm christiane amanpour.") [it:0.999995562246093]` – Mark Cramer Oct 14 '17 at 04:03
1

langdetect works much better for longer strings where it can sample more n-grams ... for short strings of a few words, it's extremely unreliable. – J. Taylor May 04 '18 at 05:34
4

@MarkCramer The algorithm is non-deterministic. If you want the same answer each time, set the seed: `import DetectorFactory DetectorFactory.seed = 0` – Philip Aug 10 '18 at 07:47
2

Quick to install, easy to use. Maybe not perfect but for my usage, it worked fine. Thank you! – mtefi Oct 15 '18 at 14:18

score 20 · Answer 3 · answered Jun 30 '13 at 03:43

20

Although this is not in the NLTK, I have had great results with another Python-based library :

https://github.com/saffsd/langid.py

This is very simple to import and includes a large number of languages in its model.

answered Jun 30 '13 at 03:43

burgersmoke

1,031
10
6

score 10 · Answer 4 · answered Oct 17 '19 at 12:07

Super late but, you could use textcat classifier in nltk, here. This paper discusses the algorithm.

It returns a country code in ISO 639-3, so I would use pycountry to get the full name.

For example, load the libraries

import nltk
import pycountry
from nltk.stem import SnowballStemmer

Now let's look at two phrases, and guess their language:

phrase_one = "good morning"
phrase_two = "goeie more"

tc = nltk.classify.textcat.TextCat() 
guess_one = tc.guess_language(phrase_one)
guess_two = tc.guess_language(phrase_two)

guess_one_name = pycountry.languages.get(alpha_3=guess_one).name
guess_two_name = pycountry.languages.get(alpha_3=guess_two).name
print(guess_one_name)
print(guess_two_name)

English
Afrikaans

You could then pass them into other nltk functions, for example:

stemmer = SnowballStemmer(guess_one_name.lower())
s1 = "walking"
print(stemmer.stem(s1))
walk

Disclaimer obviously this will not always work, especially for sparse data

Extreme example

guess_example = tc.guess_language("hello")
print(pycountry.languages.get(alpha_3=guess_example).name)
Konkani (individual language)

score 0 · Answer 5 · answered Feb 13 '23 at 15:16

polyglot.detect can detect the language:

from polyglot.detect import Detector

foreign = 'Este libro ha sido uno de los mejores libros que he leido.'
print(Detector(foreign).language)

name: Spanish     code: es       confidence:  98.0 read bytes:   865

NLTK and language detection

5 Answers5

Linked