Python - can I detect unicode string language code?

Question

I'm faced with a situation where I'm reading a string of text and I need to detect the language code (en, de, fr, es, etc).

Is there a simple way to do this in python?

I have written a [code](https://gist.github.com/ritwikmishra/bd46a4772e720aa5478283acc928b68f) to detect the script. However this will not be able to differentiate languages with same script (like en, fr, es). — Ritwik, Nov 10 '22 at 17:18

jfs · Accepted Answer · 2011-04-12T09:22:14.507

If you need to detect language in response to a user action then you could use google ajax language API:

#!/usr/bin/env python
import json
import urllib, urllib2

def detect_language(text,
    userip=None,
    referrer="http://stackoverflow.com/q/4545977/4279",
    api_key=None):        

    query = {'q': text.encode('utf-8') if isinstance(text, unicode) else text}
    if userip: query.update(userip=userip)
    if api_key: query.update(key=api_key)

    url = 'https://ajax.googleapis.com/ajax/services/language/detect?v=1.0&%s'%(
        urllib.urlencode(query))

    request = urllib2.Request(url, None, headers=dict(Referer=referrer))
    d = json.load(urllib2.urlopen(request))

    if d['responseStatus'] != 200 or u'error' in d['responseData']:
        raise IOError(d)

    return d['responseData']['language']

print detect_language("Python - can I detect unicode string language code?")

Output

en

Google Translate API v2

Default limit 100000 characters/day (no more than 5000 at a time).

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
import urllib, urllib2

from operator import itemgetter

def detect_language_v2(chunks, api_key):
    """
    chunks: either string or sequence of strings

    Return list of corresponding language codes
    """
    if isinstance(chunks, basestring):
        chunks = [chunks] 

    url = 'https://www.googleapis.com/language/translate/v2'

    data = urllib.urlencode(dict(
        q=[t.encode('utf-8') if isinstance(t, unicode) else t 
           for t in chunks],
        key=api_key,
        target="en"), doseq=1)

    # the request length MUST be < 5000
    if len(data) > 5000:
        raise ValueError("request is too long, see "
            "http://code.google.com/apis/language/translate/terms.html")

    #NOTE: use POST to allow more than 2K characters
    request = urllib2.Request(url, data,
        headers={'X-HTTP-Method-Override': 'GET'})
    d = json.load(urllib2.urlopen(request))
    if u'error' in d:
        raise IOError(d)
    return map(itemgetter('detectedSourceLanguage'), d['data']['translations'])

Now you could request detecting a language explicitly:

def detect_language_v2(chunks, api_key):
    """
    chunks: either string or sequence of strings

    Return list of corresponding language codes
    """
    if isinstance(chunks, basestring):
        chunks = [chunks] 

    url = 'https://www.googleapis.com/language/translate/v2/detect'

    data = urllib.urlencode(dict(
        q=[t.encode('utf-8') if isinstance(t, unicode) else t
           for t in chunks],
        key=api_key), doseq=True)

    # the request length MUST be < 5000
    if len(data) > 5000:
        raise ValueError("request is too long, see "
            "http://code.google.com/apis/language/translate/terms.html")

    #NOTE: use POST to allow more than 2K characters
    request = urllib2.Request(url, data,
        headers={'X-HTTP-Method-Override': 'GET'})
    d = json.load(urllib2.urlopen(request))

    return [sorted(L, key=itemgetter('confidence'))[-1]['language']
            for L in d['data']['detections']]

Example:

print detect_language_v2(
    ["Python - can I detect unicode string language code?",
     u"матрёшка",
     u"打水"], api_key=open('api_key.txt').read().strip())

Output

[u'en', u'ru', u'zh-CN']

+1: Nice way of leveraging the power of some good, existing tools. — Eric O. Lebigot, Dec 28 '10 at 15:00
@ShimonDoodkin: you could try similar services from different providers e.g., [`microsoft-translate.py`](https://gist.github.com/zed/9507298). — jfs, Dec 01 '14 at 17:02

score 7 · Answer 2 · answered Nov 03 '15 at 17:36

In my case I only need to determine two languages so I just check the first character:

import unicodedata

def is_greek(term):
    return 'GREEK' in unicodedata.name(term.strip()[0])


def is_hebrew(term):
    return 'HEBREW' in unicodedata.name(term.strip()[0])

score 6 · Answer 3 · edited Jun 20 '20 at 09:12

6

Have a look at guess-language:

Attempts to determine the natural language of a selection of Unicode (utf-8) text.

But as the name says, it guesses the language. You can't expect 100% correct results.

Edit:

guess-language is unmaintained. But there is a fork (that support python3): guess_language-spirit

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 28 '10 at 14:08

Benjamin Wohlwend

30,958
11
90
100

Paulo Scardine · Answer 4 · 2010-12-28T16:49:27.427

5

Look at Natural Language Toolkit and Automatic Language Identification using Python for ideas.

I would like to know if a Bayesian filter can get language right but I can't write a proof of concept right now.

edited Dec 28 '10 at 16:49

answered Dec 28 '10 at 14:14

Paulo Scardine

73,447
11
124
153

score 3 · Answer 5 · answered Oct 19 '16 at 06:43

3

A useful article here suggests that this open source named CLD is the best bet for detecting language in python.

The article shows a comparison of speed and accuracy between 3 solutions :

language-detection or its python port langdetect
Tika
Chromium Language Detection (CLD)

I wasted my time with langdetect now I am switching to CLD which is 16x faster than langdetect and has 98.8% accuracy

answered Oct 19 '16 at 06:43

Tushar Goswami

753
1
8
19

Any idea if langdetect has improved since you answered this question? – Glen Thompson Aug 22 '17 at 22:46

score 1 · Answer 6 · answered Dec 28 '10 at 12:23

1

Try Universal Encoding Detector its a port of chardet module from Firefox to Python.

answered Dec 28 '10 at 12:23

ismail

46,010
9
86
95

It's a nice library, but it gives me encoding instead of locale, which I have no use for. still, thanks. – sa125 Dec 28 '10 at 13:04
You can just map encoding to locale. – ismail Dec 28 '10 at 13:05
1

@İsmail 'cartman' Dönmez: That is only possible if the language has its own charset. A lot of languages share the same alphabet. Which locale does ascii map to? – pafcu Dec 28 '10 at 13:46
@pafcu, true but on a piece of text you can only detect encoding, not locale, thats system dependent. – ismail Dec 28 '10 at 13:47
1

I assume sa125 means language, not locale. – pafcu Dec 28 '10 at 13:49
@pafcu: ASCII was specifically designed for en_US; the "A" does stand for "American". A better example is windows-1252, which is used for English, German, Spanish, French, Italian, etc. – dan04 Dec 28 '10 at 21:11
@dan04: ASCII was designed for "en_US", but that does not mean that it's not used elsewhere. Just because a text is in ASCII does _not_ mean that it is written in US english. – pafcu Dec 28 '10 at 22:04
@dan04: Here's an example in ASCII that is not English text: `'Jeto ne anglijskij tekst'` (`detect_language_v2()` from my answer says (incorrectly) that it is *'cs'* (Czech) http://stackoverflow.com/questions/4545977/python-can-i-detect-unicode-string-language-code/4546813#4546813 ) Actually it is a transliteration of Russian (so google's guess is almost correct). – jfs Jan 03 '11 at 21:36
@J.F. Sebastian, nice example. – ismail Jan 03 '11 at 21:37

score -1 · Answer 7 · answered Dec 28 '10 at 13:51

-1

If you only have a limited number of possible languages, you could use a set of dictionaries (possibly only including the most common words) of each language and then check the words in your input against the dictionaries.

answered Dec 28 '10 at 13:51

pafcu

7,808
12
42
55

Python - can I detect unicode string language code?

7 Answers7

Output

Google Translate API v2

Output

Linked

Related