12

Are there any good, open source engines out there for detecting what language a text is in, perhaps with a probability metric? One that I can run locally and doesn't query Google or Bing? I'd like to detect language for each page in about 15 million pages of OCR'ed text.

Not all documents will contain languages which use the Latin alphabet.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
niklassaers
  • 8,480
  • 20
  • 99
  • 146

7 Answers7

8

Depending on what you're doing, you might want to check out the python Natural Language Processing Toolkit (NLTK), which has some support for Bayesian Learning Algorithms.

In general, the letter and word frequencies would probably be the fastest evaluation, but the NLTK (or a bayesian learning algorithm in general) will probably be useful if you need to do anything beyond identification of the language. Bayesian methods will probably be useful also if you discover the first two methods have too high of an error rate.

archgoon
  • 1,538
  • 2
  • 14
  • 19
  • Thanks for the tip, that sounds very promising. :-) Also very cool that it comes with many corpa of text, that way I won't have to train it all myself – niklassaers Jul 04 '10 at 06:46
5

You can surely build your own, given some statistics about letter frequencies, digraph frequencies, etc, of your target languages.

Then release it as open source. And voila, you have an open source engine for detecting the language of text!

Dolph
  • 49,714
  • 13
  • 63
  • 88
4

For future reference, the engine I ended up using is libtextcat which is under BSD license but seems not to be maintained since 2003. Still, it does a good job and integrates easily in my toolchain

niklassaers
  • 8,480
  • 20
  • 99
  • 146
  • "It was primarily developed for language guessing, a task on which it is known to perform with near-perfect accuracy" – Nicolas Raoul Sep 27 '10 at 09:41
  • That's frightfully optimistic. ;-) With ~1.700.000 pages run, it detects the language correctly for ~50%, has multiple suggestions in where the language appears in ~20% more, and misses the rest. For the misses, I am lucky enough to have other data to back me up :-) – niklassaers Sep 27 '10 at 19:49
3

Try CLD2:

Installation

export CPPFLAGS="-std=c++98"  # https://github.com/CLD2Owners/cld2/issues/47
pip install cld2-cffi --user

Run

import cld2

res = cld2.detect("This is a sample text.")
print(res)
res = cld2.detect("Dies ist ein Beispieltext.")
print(res)
res = cld2.detect("Je ne peut pas parler cette language.")
print(res)
res = cld2.detect(" هذه هي بعض النصوص العربية")
print(res)
res = cld2.detect("这是一些阿拉伯文字")  # Chinese?
print(res)
res = cld2.detect("これは、いくつかのアラビア語のテキストです")
print(res)
print("Supports {} languages.".format(len(cld2.LANGUAGES)))

Gives

Detections(is_reliable=True, bytes_found=23, details=(Detection(language_name=u'ENGLISH', language_code=u'en', percent=95, score=1675.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=27, details=(Detection(language_name=u'GERMAN', language_code=u'de', percent=96, score=1496.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=38, details=(Detection(language_name=u'FRENCH', language_code=u'fr', percent=97, score=1134.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=48, details=(Detection(language_name=u'ARABIC', language_code=u'ar', percent=97, score=1263.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=False, bytes_found=29, details=(Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=63, details=(Detection(language_name=u'Japanese', language_code=u'ja', percent=98, score=3848.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Supports 282 languages.

Others

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
2

I don't think you need anything very sophisticated - for example to detect if a document is in English, with a pretty high level of certainty, simply test if it contains the N most common English words - something like:

"the a an is to are in on in it"

If it contains all of those, I would say it is almost definitely English.

  • 3
    Unless you check most of those, there will be a risk for false positives... E.g. "Kan jag komma in och få lite is till min läsk?" would be flagged as English. – Gert Grenander Jul 03 '10 at 22:47
  • @Gert That's why I said "all" - of course you could also produce a percentage score. And there will always be false positives, whatever you do. –  Jul 03 '10 at 23:19
  • @Neil Butterworth - No problem. I understand what you mean. It's just that you have to be careful, since languages share some common elements. :) – Gert Grenander Jul 03 '10 at 23:30
  • I'm not looking for English in particular, my initial task is to identify which European language, including greek and swedish – niklassaers Jul 04 '10 at 06:44
1

You could alternatively try Ruby's WhatLanguage gem, it's nice and simple and I've used in for Twitter data analysis. Check out: http://www.youtube.com/watch?v=lNqZ2cqOReo&list=UUJ_3fstMOH-g4yBxtvgAWkw&index=0&feature=plcp for a quick demo

alexizydorczyk
  • 850
  • 1
  • 6
  • 25
1

Check out Franc on Github. It's written in JavaScript, so you could use in a browser and maybe in Node too.

  • franc supports more languages than any other library, or Google;
  • franc is easily forked to support 335 languages; franc is just as
  • fast as the competition.
skibulk
  • 3,088
  • 1
  • 34
  • 42