Determine if text is in English?

Question

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true:

[ "this is some text written in English", 
  "this is some more text written in English", 
  "Ce n'est pas en anglais" ]

For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but cannot find anything specific that will let me recognize if strings are in English or not. Is this something that is not offered as functionality in either Nltk or Scikit learn? EDIT I've seen questions both like this and this but both are for individual words... Not a "document". Would I have to loop through every word in a sentence to check if the whole sentence is in English?

I'm using Python, so libraries that are in Python would be preferable, but I can switch languages if needed, just thought that Python would be the best for this.

score 27 · Answer 1 · answered Apr 12 '17 at 18:46

27

There is a library called langdetect. It is ported from Google's language-detection available here:

https://pypi.python.org/pypi/langdetect

It supports 55 languages out of the box.

answered Apr 12 '17 at 18:46

salehinejad

7,258
3
18
26

4

Exactly what I was looking for thanks! :) Just a question, do you know anything about the performance of this library on long documents? – ocean800 Apr 12 '17 at 18:54
4

I have not used it. It will be great you share your experience here. – salehinejad Apr 12 '17 at 19:03
4

Unfortunately, it was very slow on long sets of documents, but thanks! – ocean800 Apr 16 '17 at 17:11
langdetect doesn't detect correctly sometimes. It fails.. I tried to detect the word 'DRIVE' but, it was saying its German – Pravin Nov 16 '22 at 08:57
2

@ocean800 Why do you care about long documents? If a document is in English all sentences are in English. This means it is suffcient to analyze just one sentence. – ceving Dec 10 '22 at 11:05

Martin Thoma · Answer 2 · 2018-02-06T05:43:02.820

23

You might be interested in my paper The WiLI benchmark dataset for written language identification. I also benchmarked a couple of tools.

TL;DR:

CLD-2 is pretty good and extremely fast
lang-detect is a tiny bit better, but much slower
langid is good, but CLD-2 and lang-detect are much better
NLTK's Textcat is neither efficient nor effective.

You can install lidtk and classify languages:

$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"                  
fra

edited Feb 06 '18 at 05:43

answered Jan 25 '18 at 05:58

Martin Thoma

124,992
159
614
958

1

Does cld2 have support in Python3? – jeffry copps Apr 16 '19 at 15:03
Yes it does. Thanks – jeffry copps Apr 17 '19 at 12:40
Great response! Do you mean lang-detect is the most accurate? I need more accuracy than speed. – Andrew Aug 22 '19 at 17:43
3

If you want a more detailed answer, I recommend reading the paper. – Martin Thoma Aug 22 '19 at 18:16
Martin Thoma, I want something fast to detect English language in python way. There is python3 support. but how to call langDetect or predict function? do you have any examples? Thank you – tursunWali Oct 04 '21 at 04:11
Thanks for sharing. Very useful. Looking at your paper now. Has it been presented at a conference or published in a journal? So I could cite it accordingly. – Simone Jun 16 '22 at 11:37
1

@Simone Thank you! If you go to https://arxiv.org/abs/1801.07779 you can see a Bibtex export with all the details for citation. It was not published in a peer-reviewed journal / conference – Martin Thoma Jun 16 '22 at 12:07

score 5 · Answer 3 · answered Nov 14 '19 at 16:50

Pretrained Fast Text Model Worked Best For My Similar Needs

I arrived at your question with a very similar need. I appreciated Martin Thoma's answer. However, I found the most help from Rabash's answer part 7 HERE.

After experimenting to find what worked best for my needs, which were making sure text files were in English in 60,000+ text files, I found that fasttext was an excellent tool.

With a little work, I had a tool that worked very fast over many files. Below is the code with comments. I believe that you and others will be able to modify this code for your more specific needs.

class English_Check:
    def __init__(self):
        # Don't need to train a model to detect languages. A model exists
        #    that is very good. Let's use it.
        pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
        self.model = fasttext.load_model(pretrained_model_path)

    def predictionict_languages(self, text_file):
        this_D = {}
        with open(text_file, 'r') as f:
            fla = f.readlines()  # fla = file line array.
            # fasttext doesn't like newline characters, but it can take
            #    an array of lines from a file. The two list comprehensions
            #    below, just clean up the lines in fla
            fla = [line.rstrip('\n').strip(' ') for line in fla]
            fla = [line for line in fla if len(line) > 0]

            for line in fla:  # Language predict each line of the file
                language_tuple = self.model.predictionict(line)
                # The next two lines simply get at the top language prediction
                #    string AND the confidence value for that prediction.
                prediction = language_tuple[0][0].replace('__label__', '')
                value = language_tuple[1][0]

                # Each top language prediction for the lines in the file
                #    becomes a unique key for the this_D dictionary.
                #    Everytime that language is found, add the confidence
                #    score to the running tally for that language.
                if prediction not in this_D.keys():
                    this_D[prediction] = 0
                this_D[prediction] += value

        self.this_D = this_D

    def determine_if_file_is_english(self, text_file):
        self.predictionict_languages(text_file)

        # Find the max tallied confidence and the sum of all confidences.
        max_value = max(self.this_D.values())
        sum_of_values = sum(self.this_D.values())
        # calculate a relative confidence of the max confidence to all
        #    confidence scores. Then find the key with the max confidence.
        confidence = max_value / sum_of_values
        max_key = [key for key in self.this_D.keys()
                   if self.this_D[key] == max_value][0]

        # Only want to know if this is english or not.
        return max_key == 'en'

Below is the application / instantiation and use of the above class for my needs.

file_list = # some tool to get my specific list of files to check for English

en_checker = English_Check()
for file in file_list:
    check = en_checker.determine_if_file_is_english(file)
    if not check:
        print(file)

score 4 · Answer 4 · edited Jun 02 '21 at 08:30

4

This is what I've used some time ago. It works for texts longer than 3 words and with less than 3 non-recognized words. Of course, you can play with the settings, but for my use case (website scraping) those worked pretty well.

from enchant.checker import SpellChecker

max_error_count = 4
min_text_length = 3

def is_in_english(quote):
  d = SpellChecker("en_US")
  d.set_text(quote)
  errors = [err.word for err in d]
  return False if ((len(errors) > max_error_count) or len(quote.split()) < min_text_length) else True

print(is_in_english('“中文”'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True

edited Jun 02 '21 at 08:30

DisappointedByUnaccountableMod

6,656
4
18
22

answered Aug 04 '19 at 18:56

grizmin

149
5

Does the code need both single quote and double quotes for the `is_in_english` call? – David Medinets Aug 06 '20 at 22:08
if you look at it closely, you will see that those are in fact not a regular double quotes. Just a symbol that looks like double quotes. – grizmin Aug 08 '20 at 20:39

score 2 · Answer 5 · answered Apr 12 '17 at 18:52

2

Use the enchant library

import enchant

dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc

dictionary.check("Hello") # prints True
dictionary.check("Helo") #prints False

This example is taken directly from their website

answered Apr 12 '17 at 18:52

lordingtar

1,042
2
12
29

Thanks this library looks interesting as well. Do you know anything abotu the performance of this library on long document strings? – ocean800 Apr 12 '17 at 18:56
I haven't used it on very long document strings; I trained my own model for that. Give it a shot and see if the library is powerful enough for you! It also has its own spellchecker (primary purpose of the library) – lordingtar Apr 12 '17 at 19:04
Will try it out and see which library works better, thanks :) – ocean800 Apr 12 '17 at 19:08
4

enchant seems only able to characterize English words, instead of phrases. e.g. "Hello" is checked as `True` but "hello world" is checked as `False`. It is also no longer actively maintained. – yuqli Nov 09 '18 at 17:28

score 2 · Answer 6 · answered Apr 12 '17 at 21:45

If you want something lightweight, letter trigrams are a popular approach. Every language has a different "profile" of common and uncommon trigrams. You can google around for it, or code your own. Here's a sample implementation I came across, which uses "cosine similarity" as a measure of distance between the sample text and the reference data:

http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/

If you know the common non-English languages in your corpus, it's pretty easy to turn this into a yes/no test. If you don't, you need to anticipate sentences from languages for which you don't have trigram statistics. I would do some testing to see the normal range of similarity scores for single-sentence texts in your documents, and choose a suitable threshold for the English cosine score.

Thanks for this answer! Just a question, do you know anything about the performance of this on large datasets? — ocean800, Apr 16 '17 at 17:03
Trigram models are fast... there's not much to do. But what do you mean by "large dataset"? If each of your documents are in a single language, and you have so many documents that counting trigrams over the entire document is slowing you down, just stop after a few hunderd words. — alexis, Apr 17 '17 at 19:17

score 0 · Answer 7 · answered Apr 14 '22 at 01:45

import enchant
def check(text):
    text=text.split()
    dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc
    for i in range(len(text)):
        if(dictionary.check(text[i])==False):
            o = "False"
            break
        else:
            o = ("True")
        return o

Determine if text is in English?

7 Answers7

Pretrained Fast Text Model Worked Best For My Similar Needs

Linked