Detecting whether or not text is English (in bulk)

Question

I'm looking for a simple way to detect whether a short excerpt of text, a few sentences, is English or not. Seems to me that this problem is much easier than trying to detect an arbitrary language. Is there any software out there that can do this? I'm writing in python, and would prefer a python library, but something else would be fine too. I've tried google, but then realized the TOS didn't allow automated queries.

possible duplicate of [Python - can I detect unicode string language code?](http://stackoverflow.com/questions/4545977/python-can-i-detect-unicode-string-language-code) — ismail, Jan 05 '11 at 14:26
I'm asking for English only here, as opposed to that thread where they ask for any arbitrary language. — user449511, Jan 05 '11 at 14:34
@user Look at some of the answers there, they might still be applicable. Google Translate also detects language and it worked for you when it did. — moinudin, Jan 05 '11 at 14:45

score 11 · Accepted Answer · edited May 16 '19 at 03:28

11

I read a method to detect English language by using Trigrams

You can go over the text, and try to detect the most used trigrams in the words. If the most used ones match with the most used among english words, the text may be written in English

Try to look in this ruby project:

https://github.com/feedbackmine/language_detector

edited May 16 '19 at 03:28

user6039980

3,108
8
31
57

answered Jan 05 '11 at 14:28

HyLian

4,999
5
33
40

Thanks! This is a easy idea to implement, I can give this a quick test with a small set of test text that I have to see how well it works! – user449511 Jan 05 '11 at 14:37
This is going to require a large batch of sample text. OP might not have access to that. – moinudin Jan 05 '11 at 14:38

moinudin · Answer 2 · 2011-01-05T14:40:57.747

4

EDIT: This won't work in this case, since OP is processing text in bulk which is against Google's TOS.

Use the Google Translate language detect API. Python example from the docs:

url = ('https://ajax.googleapis.com/ajax/services/language/detect?' +
       'v=1.0&q=Hola,%20mi%20amigo!&key=INSERT-YOUR-KEY&userip=INSERT-USER-IP')
request = urllib2.Request(url, None, {'Referer': /* Enter the URL of your site here */})
response = urllib2.urlopen(request)
results = simplejson.load(response)
if results['responseData']['language'] == 'en':
    print 'English detected'

edited Jan 05 '11 at 14:40

answered Jan 05 '11 at 14:26

moinudin

134,091
45
190
216

"The Google Language Detect API must be used for user-generated language detection. Automated or batched queries of any kind are strictly prohibited". I guess that is why the question asker is referring to the Terms of Service he saw as well, and I assume he therefore wants to detect a language without any user input. – Tom van Enckevort Jan 05 '11 at 14:33
@tomlog You're probably right. I thought he was referring to scraping GT pages. @user, can you confirm whether or not you're processing user-generated strings? – moinudin Jan 05 '11 at 14:36
I was batch querying their api with my text and got denied access and realized my problem. I'm not using user-generated strings. Thanks! – user449511 Jan 05 '11 at 14:38
@user Okay, then this won't work. I'll keep my answer for reference (in case others come along), but will add a note. – moinudin Jan 05 '11 at 14:40
Thanks anyway! I will say, it did work very well while it did. I had no false positives in my filtered list when I went by their "reliable" tag. – user449511 Jan 05 '11 at 14:42
@user In future, if you don't have too much to process you can avoid being banned by processing the requests at a human speed. – moinudin Jan 05 '11 at 14:44

score 1 · Answer 3 · answered Jan 05 '11 at 17:12

Altough not as good as Google's own, I have had good results using Apache Nutch LanguageIdentifier which comes with its own pretrained ngram models. I had quite good results on a large (50GB pdf, text-mostly) corpus of real-world data in several languages.

It is in Java, but I'm sure you can reread the ngram profiles from it if you want to reimplement it in Python.

score 1 · Answer 4 · edited May 23 '17 at 12:17

Google Translate API v2 allows automated queries but it requires the use of an API key that you can freely get at Google APIs console.

To detect whether text is English you could use detect_language_v2() function (that uses that API) from my answer to the question Python - can I detect unicode string language code?:

 if all(lang == 'en' for lang in detect_language_v2(['some text', 'more text'])):
    # all text fragments are in English

Max von Hippel · Answer 5 · 2020-07-29T19:55:16.180

I recently wrote a solution for this. My solution is not fool proof and I do not think it would be computationally viable for large amounts of text, but it seems to me to work well for smallish sentences.

Suppose you have two strings of text:

"LETMEBEGINBYSAYINGTHANKS"
"UNGHSYINDJFHAKJSNFNDKUAJUD"

The goal then is to determine that 1. is probably English while 2. is not. Intuitively, the way my mind determines this is by looking for the word boundaries of English words in the sentences (LET, ME, BEGIN, etc.). But this is not straightforward computationally because there are overlapping words (BE, GIN, BEGIN, SAY, SAYING, THANK, THANKS, etc.).

My method does the following:

Take the intersection of { known English words } and { all substrings of the text of all lengths }.
Construct a graph of vertices, the positions of which are the starting indices of the words in the sentence, with directed edges to the starting positions of the letter after the end of the word. E.g, (0) would be L, so "LET" could be represented by (0) -> (3), where (3) is M so that's "LET ME".
Find the largest integer n between 0 and len(text) for which a simple directed path exists from index 0 to index n.
Divide that number n by the length of the text to get a rough idea of what percent of the text appears to be consecutive English words.

Note that my code assumes no spaces between words. If you have spaces already, then my method is silly, since the core of my solution is about figuring out where the spaces should be. (If you are reading this and you have spaces then you probably are trying to solve a more sophisticated problem.). Also, for my code to work you need an English wordlist file. I got one from here, but you can use any such file, and I imagine in this way this technique could be extended to other languages too.

Here is the code:

from collections import defaultdict

# This function tests what percent of the string seems to me to be maybe
# English-language
# We use an English words list from here: 
# https://github.com/first20hours/google-10000-english
def englishness(maybeplaintext):
    maybeplaintext = maybeplaintext.lower()
    f = open('words.txt', 'r')
    words = f.read()
    f.close()
    words = words.lower().split("\n")
    letters = [c for c in maybeplaintext]
    # Now let's iterate over letters and look for some English!
    wordGraph = defaultdict(list)
    lt = len(maybeplaintext)
    for start in range(0, lt):
        st = lt - start
        if st > 1:
            for length in range(2, st):
                end = start + length
                possibleWord = maybeplaintext[start:end]
                if possibleWord in words:
                    if not start in wordGraph:
                        wordGraph[start] = []
                    wordGraph[start].append(end)
    # Ok, now we have a big graph of words.
    # What is the shortest path from the first letter to the last letter,
    # moving exclusively through the English language?
    # Does any such path exist?
    englishness = 0
    values = set([a for sublist in list(wordGraph.values()) for a in sublist])
    numberVertices = len(set(wordGraph.keys()).union(values))
    for i in range(2, lt):
        if isReachable(numberVertices, wordGraph, i):
            englishness = i
    return englishness/lt
    
# Here I use my modified version of the technique from:
# https://www.geeksforgeeks.org/
#   find-if-there-is-a-path-between-two-vertices-in-a-given-graph/
def isReachable(numberVertices, wordGraph, end):
    visited = [0]
    queue = [0]
    while queue:
        n = queue.pop(0)
        if n == end or n > end:
            return True
        for i in wordGraph[n]:
            if not i in visited:
                queue.append(i)
                visited.append(i)
    return False

And here is I/O for the initial examples I gave:

In [5]: englishness('LETMEBEGINBYSAYINGTHANKS')
Out[5]: 0.9583333333333334

In [6]: englishness('UNGHSYINDJFHAKJSNFNDKUAJUD')
Out[6]: 0.07692307692307693

So then approximately speaking, I am 96% certain that LETMEBEGINBYSAYINGTHANKS is English, and 8% certain that UNGHSYINDJFHAKJSNFNDKUAJUD is English. Which sounds about right!

To extend this to much larger pieces of text, my suggestion would be to subsample random short substrings and check their "englishness". Hope this helps!

A professor of mine observed that my technique could be improved by going backward rather than forward through the graph, assuming that more often than not we are not looking at English. Additionally, I think a slight improvement could be made with a bisect search method, to get rid of unnecessary checks - weather or not this would improve things likely depends on the frequency distribution of length of Englishness of input. — Max von Hippel, Feb 19 '18 at 20:29

Detecting whether or not text is English (in bulk)

5 Answers5

Linked