25

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true:

[ "this is some text written in English", 
  "this is some more text written in English", 
  "Ce n'est pas en anglais" ] 

For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but cannot find anything specific that will let me recognize if strings are in English or not. Is this something that is not offered as functionality in either Nltk or Scikit learn? EDIT I've seen questions both like this and this but both are for individual words... Not a "document". Would I have to loop through every word in a sentence to check if the whole sentence is in English?

I'm using Python, so libraries that are in Python would be preferable, but I can switch languages if needed, just thought that Python would be the best for this.

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
ocean800
  • 3,489
  • 13
  • 41
  • 73

7 Answers7

27

There is a library called langdetect. It is ported from Google's language-detection available here:

https://pypi.python.org/pypi/langdetect

It supports 55 languages out of the box.

salehinejad
  • 7,258
  • 3
  • 18
  • 26
  • 4
    Exactly what I was looking for thanks! :) Just a question, do you know anything about the performance of this library on long documents? – ocean800 Apr 12 '17 at 18:54
  • 4
    I have not used it. It will be great you share your experience here. – salehinejad Apr 12 '17 at 19:03
  • 4
    Unfortunately, it was very slow on long sets of documents, but thanks! – ocean800 Apr 16 '17 at 17:11
  • langdetect doesn't detect correctly sometimes. It fails.. I tried to detect the word 'DRIVE' but, it was saying its German – Pravin Nov 16 '22 at 08:57
  • 2
    @ocean800 Why do you care about long documents? If a document is in English all sentences are in English. This means it is suffcient to analyze just one sentence. – ceving Dec 10 '22 at 11:05
23

You might be interested in my paper The WiLI benchmark dataset for written language identification. I also benchmarked a couple of tools.

TL;DR:

  • CLD-2 is pretty good and extremely fast
  • lang-detect is a tiny bit better, but much slower
  • langid is good, but CLD-2 and lang-detect are much better
  • NLTK's Textcat is neither efficient nor effective.

You can install lidtk and classify languages:

$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"                  
fra
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
5

Pretrained Fast Text Model Worked Best For My Similar Needs

I arrived at your question with a very similar need. I appreciated Martin Thoma's answer. However, I found the most help from Rabash's answer part 7 HERE.

After experimenting to find what worked best for my needs, which were making sure text files were in English in 60,000+ text files, I found that fasttext was an excellent tool.

With a little work, I had a tool that worked very fast over many files. Below is the code with comments. I believe that you and others will be able to modify this code for your more specific needs.

class English_Check:
    def __init__(self):
        # Don't need to train a model to detect languages. A model exists
        #    that is very good. Let's use it.
        pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
        self.model = fasttext.load_model(pretrained_model_path)

    def predictionict_languages(self, text_file):
        this_D = {}
        with open(text_file, 'r') as f:
            fla = f.readlines()  # fla = file line array.
            # fasttext doesn't like newline characters, but it can take
            #    an array of lines from a file. The two list comprehensions
            #    below, just clean up the lines in fla
            fla = [line.rstrip('\n').strip(' ') for line in fla]
            fla = [line for line in fla if len(line) > 0]

            for line in fla:  # Language predict each line of the file
                language_tuple = self.model.predictionict(line)
                # The next two lines simply get at the top language prediction
                #    string AND the confidence value for that prediction.
                prediction = language_tuple[0][0].replace('__label__', '')
                value = language_tuple[1][0]

                # Each top language prediction for the lines in the file
                #    becomes a unique key for the this_D dictionary.
                #    Everytime that language is found, add the confidence
                #    score to the running tally for that language.
                if prediction not in this_D.keys():
                    this_D[prediction] = 0
                this_D[prediction] += value

        self.this_D = this_D

    def determine_if_file_is_english(self, text_file):
        self.predictionict_languages(text_file)

        # Find the max tallied confidence and the sum of all confidences.
        max_value = max(self.this_D.values())
        sum_of_values = sum(self.this_D.values())
        # calculate a relative confidence of the max confidence to all
        #    confidence scores. Then find the key with the max confidence.
        confidence = max_value / sum_of_values
        max_key = [key for key in self.this_D.keys()
                   if self.this_D[key] == max_value][0]

        # Only want to know if this is english or not.
        return max_key == 'en'

Below is the application / instantiation and use of the above class for my needs.

file_list = # some tool to get my specific list of files to check for English

en_checker = English_Check()
for file in file_list:
    check = en_checker.determine_if_file_is_english(file)
    if not check:
        print(file)
Thom Ives
  • 3,642
  • 3
  • 30
  • 29
4

This is what I've used some time ago. It works for texts longer than 3 words and with less than 3 non-recognized words. Of course, you can play with the settings, but for my use case (website scraping) those worked pretty well.

from enchant.checker import SpellChecker

max_error_count = 4
min_text_length = 3

def is_in_english(quote):
  d = SpellChecker("en_US")
  d.set_text(quote)
  errors = [err.word for err in d]
  return False if ((len(errors) > max_error_count) or len(quote.split()) < min_text_length) else True

print(is_in_english('“中文”'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True
grizmin
  • 149
  • 5
  • Does the code need both single quote and double quotes for the `is_in_english` call? – David Medinets Aug 06 '20 at 22:08
  • if you look at it closely, you will see that those are in fact not a regular double quotes. Just a symbol that looks like double quotes. – grizmin Aug 08 '20 at 20:39
2

Use the enchant library

import enchant

dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc

dictionary.check("Hello") # prints True
dictionary.check("Helo") #prints False

This example is taken directly from their website

lordingtar
  • 1,042
  • 2
  • 12
  • 29
  • Thanks this library looks interesting as well. Do you know anything abotu the performance of this library on long document strings? – ocean800 Apr 12 '17 at 18:56
  • I haven't used it on very long document strings; I trained my own model for that. Give it a shot and see if the library is powerful enough for you! It also has its own spellchecker (primary purpose of the library) – lordingtar Apr 12 '17 at 19:04
  • Will try it out and see which library works better, thanks :) – ocean800 Apr 12 '17 at 19:08
  • 4
    enchant seems only able to characterize English words, instead of phrases. e.g. "Hello" is checked as `True` but "hello world" is checked as `False`. It is also no longer actively maintained. – yuqli Nov 09 '18 at 17:28
2

If you want something lightweight, letter trigrams are a popular approach. Every language has a different "profile" of common and uncommon trigrams. You can google around for it, or code your own. Here's a sample implementation I came across, which uses "cosine similarity" as a measure of distance between the sample text and the reference data:

http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/

If you know the common non-English languages in your corpus, it's pretty easy to turn this into a yes/no test. If you don't, you need to anticipate sentences from languages for which you don't have trigram statistics. I would do some testing to see the normal range of similarity scores for single-sentence texts in your documents, and choose a suitable threshold for the English cosine score.

alexis
  • 48,685
  • 16
  • 101
  • 161
  • Thanks for this answer! Just a question, do you know anything about the performance of this on large datasets? – ocean800 Apr 16 '17 at 17:03
  • 1
    Trigram models are fast... there's not much to do. But what do you mean by "large dataset"? If each of your documents are in a single language, and you have so many documents that counting trigrams over the entire document is slowing you down, just stop after a few hunderd words. – alexis Apr 17 '17 at 19:17
0
import enchant
def check(text):
    text=text.split()
    dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc
    for i in range(len(text)):
        if(dictionary.check(text[i])==False):
            o = "False"
            break
        else:
            o = ("True")
        return o