10

Does anyone has any experience how to determine the language of a text using python? Is there an available module for this in python?

I've already tried the google app (http://ajax.googleapis.com/ajax/services/language/detect), and it worked properly but I cannot use it for a long term for loads of text files.

Shawn Chin
  • 84,080
  • 19
  • 162
  • 191
hu3b11b7
  • 169
  • 5

3 Answers3

3

There's official Python bindings for the CLD3 neural network model, which is what Chrome uses for offline language detection.

sudo apt install -y protobuf-compiler
pip install gcld3

Like all Python code from Google that I've used, it's unpythonic and just generally sucks to use but at least it works well:

>>> import gcld3
>>> lang_identifier = gcld3.NNetLanguageIdentifier(0, 1000)
>>> lang_identifier.Find
lang_identifier.FindLanguage(           lang_identifier.FindTopNMostFreqLangs(  
>>> a = lang_identifier.FindLanguage("This is a test")
>>> a
<gcld3.pybind_ext.Result object at 0x7f606e0ec3b0>
>>> a.
a.is_reliable  a.language     a.probability  a.proportion   
>>> a.language
'en'
>>> a = lang_identifier.FindTopNMostFreqLangs("This piece of text is in English. Този текст е на Български.", 5)
>>> a
[<gcld3.pybind_ext.Result object at 0x7f606e0ec4b0>, <gcld3.pybind_ext.Result object at 0x7f606e0ec570>, <gcld3.pybind_ext.Result object at 0x7f606e0ec470>, <gcld3.pybind_ext.Result object at 0x7f606e0ec5b0>, <gcld3.pybind_ext.Result object at 0x7f606e0ec530>]
>>> [r.language for r in a]
['bg', 'en', 'und', 'und', 'und']

You can also try the unofficial Python bindings https://github.com/bsolomon1124/pycld3

Boris Verkhovskiy
  • 14,854
  • 11
  • 100
  • 103
3

I've never tried this, but it appears you can do this with NLTK (Natural Language Tookit). See this blog post for an example.

The answer to the following question might also be relevant: NLTK and language detection

Community
  • 1
  • 1
Shawn Chin
  • 84,080
  • 19
  • 162
  • 191
0

There is Language Detection API which you can use from Python as a web service. It accepts text through GET or POST and provides JSON output with scores.

Laurynas
  • 3,829
  • 2
  • 32
  • 22
  • 1
    This costs money and I'm like 33% sure that it's just re-selling Google Translate, judging from the fact that it also uses `iw` as the language code for Hebrew which has been officially deprecated since 1989. – Boris Verkhovskiy Mar 23 '21 at 19:46