python website language detection

Question

i am writing a Bot that can just check thousands of website either they are in English or not.

i am using Scrapy (python 2.7 framework) for crawling each website first page ,

can some one suggest me which is the best way to check website language ,

any help would be appreciated.

score 4 · Answer 1 · edited May 23 '17 at 11:47

4

Since you are using Python, you can try out NLTK. More precisely you can check for NLTK.detect

More information and the exact code snippet is here: NLTK and language detection

edited May 23 '17 at 11:47

Community

1
1

answered Jul 16 '12 at 15:21

Yavar

11,883
5
32
63

score 2 · Answer 2 · answered Jul 16 '12 at 15:18

2

You can use the response headers to find out:

Wikipedia

answered Jul 16 '12 at 15:18

Hedde van der Heide

21,841
13
71
100

does every website has Content-Language attribute ? i dont have much exposure of websites ? – akhter wahab Jul 16 '12 at 15:22
1

Likely, it's part of the http protocol, it's the easiest way to meet your requirement without other dependencies. If it doesn't suit your needs one can always extend to other measures. You might want a fallback pipeline for instance – Hedde van der Heide Jul 16 '12 at 15:32
can you please assist me more regarding your " You might want a fallback pipeline for instance" these words. – akhter wahab Jul 16 '12 at 15:40
You could create a cycle of options to define the language, starting with the one least resource costly, moving on to something more robust every time the previous method failed – Hedde van der Heide Jul 16 '12 at 15:43
-1 The HTTP header is not very reliable. Many page authors don't mark up the language they write in, many web page authoring tools won't let them, many admins don't let users set this for individual pages, etc; and when people do try to specify this information, they sometimes get it wrong (for example, many Swedish pages have the country code for Sweden `se` instead of the language code for Swedish `sv`). – tripleee Jul 16 '12 at 18:41
Yes but its the cheapest variable to access and as I said it could serve as à starting point for a fallback cycle based on resource expense – Hedde van der Heide Jul 16 '12 at 20:22

score 2 · Answer 3 · answered Jul 16 '12 at 15:31

If the sites are multilanguage you can send the "Accept-Language:en-US,en;q=0.8" header and expect the response to be in english. If they are not, you can inspect the "response.headers" dictionary and see if you can find any information about the language.

If still unlucky, you can try mapping the IP to the country and then to the language in some way. As a last resource, try detecting the language (I don't know how accurate this is).

score 2 · Answer 4 · edited Sep 28 '12 at 19:48

2

If you are using Python, I highly recommend standalone LangID module written by Marco Lui and Tim Baldwin. The model is pre-trained and the character detection is highly accurate. It can also handle XML/HTML document.

edited Sep 28 '12 at 19:48

Paolo Moretti

54,162
23
101
92

answered Aug 18 '12 at 15:52

nqngo

528
3
10

score 1 · Accepted Answer · answered Jul 16 '12 at 15:23

1

Look into Natural Language Toolkit:

NLTK: http://nltk.org/

What you want to look into is using corpus to extract the default vocabulary set by NLTK:

nltk.corpus.words.words()

Then, compare your text with the above using difflib.

Reference: http://docs.python.org/library/difflib.html

Using these tools, you can create a scale to measure the difference required between your text and the english words defined by NLTK.

answered Jul 16 '12 at 15:23

Daniel Li

14,976
6
43
60

In a resource efficient crawler this is something I would add somewhere at the bottom of my pipeline tbh – Hedde van der Heide Jul 16 '12 at 15:33
Update: NLTK now offers a [module for language identification](http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.textcat) – avip Mar 15 '16 at 06:03

score 1 · Answer 6 · answered Jan 21 '13 at 22:05

1

You can use Language Detection API at http://detectlanguage.com It accepts text string via GET or POST and provides JSON output with scores. There is free and premium services.

answered Jan 21 '13 at 22:05

Laurynas

3,829
2
32
22

SSSSSam · Answer 7 · 2012-07-16T16:36:45.270

If a html website is using non English characters it is mentioned in the webpage source code in the meta tag. this helps browsers know how to render the page.

here is an example off an arabic website http://www.tanmia.ae that has both an English page and Arabic page

meta tag in the Arabic page is : meta http-equiv="X-UA-Compatible" content="IE=edge

The same page but in English is meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /

maybe have the bot look into the meta tag if its english then proceed else ignore?

score 0 · Answer 8 · answered Jul 18 '12 at 20:35

If you don't want to trust what the webpage tells you but want to check for yourself, you can use a statistical algorithm for language detection. Trigram-based algorithms are robust and should work well with pages that are mostly on another language but have a bit of English (enough to fool heuristics like "check if the words the, and, or with are on the page) Google "ngram language classification" and you'll find lots of references on how it's done.

It's easy enough to compile your own trigram tables for English, but the Natural Language Toolkit comes with a set for several common languages. They are in NLTK_DATA/corpora/langid. You could use the trigram data without the nltk library itself, but you might also want to look into the nltk.util.trigrams module.

python website language detection

8 Answers8