5

i am writing a Bot that can just check thousands of website either they are in English or not.

i am using Scrapy (python 2.7 framework) for crawling each website first page ,

can some one suggest me which is the best way to check website language ,

any help would be appreciated.

DSM
  • 342,061
  • 65
  • 592
  • 494
akhter wahab
  • 4,045
  • 1
  • 25
  • 47

8 Answers8

4

Since you are using Python, you can try out NLTK. More precisely you can check for NLTK.detect

More information and the exact code snippet is here: NLTK and language detection

Community
  • 1
  • 1
Yavar
  • 11,883
  • 5
  • 32
  • 63
2

You can use the response headers to find out:

Wikipedia

Hedde van der Heide
  • 21,841
  • 13
  • 71
  • 100
  • does every website has Content-Language attribute ? i dont have much exposure of websites ? – akhter wahab Jul 16 '12 at 15:22
  • 1
    Likely, it's part of the http protocol, it's the easiest way to meet your requirement without other dependencies. If it doesn't suit your needs one can always extend to other measures. You might want a fallback pipeline for instance – Hedde van der Heide Jul 16 '12 at 15:32
  • can you please assist me more regarding your " You might want a fallback pipeline for instance" these words. – akhter wahab Jul 16 '12 at 15:40
  • You could create a cycle of options to define the language, starting with the one least resource costly, moving on to something more robust every time the previous method failed – Hedde van der Heide Jul 16 '12 at 15:43
  • -1 The HTTP header is not very reliable. Many page authors don't mark up the language they write in, many web page authoring tools won't let them, many admins don't let users set this for individual pages, etc; and when people do try to specify this information, they sometimes get it wrong (for example, many Swedish pages have the country code for Sweden `se` instead of the language code for Swedish `sv`). – tripleee Jul 16 '12 at 18:41
  • Yes but its the cheapest variable to access and as I said it could serve as à starting point for a fallback cycle based on resource expense – Hedde van der Heide Jul 16 '12 at 20:22
2

If the sites are multilanguage you can send the "Accept-Language:en-US,en;q=0.8" header and expect the response to be in english. If they are not, you can inspect the "response.headers" dictionary and see if you can find any information about the language.

If still unlucky, you can try mapping the IP to the country and then to the language in some way. As a last resource, try detecting the language (I don't know how accurate this is).

martincho
  • 4,517
  • 7
  • 32
  • 42
2

If you are using Python, I highly recommend standalone LangID module written by Marco Lui and Tim Baldwin. The model is pre-trained and the character detection is highly accurate. It can also handle XML/HTML document.

Paolo Moretti
  • 54,162
  • 23
  • 101
  • 92
nqngo
  • 528
  • 3
  • 10
1

Look into Natural Language Toolkit:

NLTK: http://nltk.org/

What you want to look into is using corpus to extract the default vocabulary set by NLTK:

nltk.corpus.words.words()

Then, compare your text with the above using difflib.

Reference: http://docs.python.org/library/difflib.html

Using these tools, you can create a scale to measure the difference required between your text and the english words defined by NLTK.

Daniel Li
  • 14,976
  • 6
  • 43
  • 60
  • In a resource efficient crawler this is something I would add somewhere at the bottom of my pipeline tbh – Hedde van der Heide Jul 16 '12 at 15:33
  • Update: NLTK now offers a [module for language identification](http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.textcat) – avip Mar 15 '16 at 06:03
1

You can use Language Detection API at http://detectlanguage.com It accepts text string via GET or POST and provides JSON output with scores. There is free and premium services.

Laurynas
  • 3,829
  • 2
  • 32
  • 22
0

If a html website is using non English characters it is mentioned in the webpage source code in the meta tag. this helps browsers know how to render the page.

here is an example off an arabic website http://www.tanmia.ae that has both an English page and Arabic page

meta tag in the Arabic page is : meta http-equiv="X-UA-Compatible" content="IE=edge

The same page but in English is meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /

maybe have the bot look into the meta tag if its english then proceed else ignore?

SSSSSam
  • 113
  • 1
  • 2
  • 8
0

If you don't want to trust what the webpage tells you but want to check for yourself, you can use a statistical algorithm for language detection. Trigram-based algorithms are robust and should work well with pages that are mostly on another language but have a bit of English (enough to fool heuristics like "check if the words the, and, or with are on the page) Google "ngram language classification" and you'll find lots of references on how it's done.

It's easy enough to compile your own trigram tables for English, but the Natural Language Toolkit comes with a set for several common languages. They are in NLTK_DATA/corpora/langid. You could use the trigram data without the nltk library itself, but you might also want to look into the nltk.util.trigrams module.

alexis
  • 48,685
  • 16
  • 101
  • 161