4

I simply want to know if a web page is in English or not. Is there any good way to do it?

The closest I've found is Detect language from string in PHP but it is of some use for me..

Any suggestions?

I've a sample non-English site:

Community
  • 1
  • 1
AgA
  • 2,078
  • 7
  • 33
  • 62
  • I'd search for the word "the". If it is english, there should be lots of "the"s. – ahmetunal Mar 20 '12 at 18:25
  • The above sample site in Russian does contain some "the's" though. – AgA Mar 21 '12 at 04:10
  • Related / duplicate question: http://linguistics.stackexchange.com/questions/1871/efficient-linguistic-algorithms-for-detecting-language-of-a-website – Mark Butler Mar 11 '13 at 06:35

4 Answers4

2

It seems that there are nearly all/a lot of possiblities to detect a language in your linked question. Why you can not use one of the proposed answers?

One more solution (but not a reliable) ist to look for meta tags with language information like:

<meta name="DC.language" content="en" scheme="DCTERMS.RFC3066">
<meta name="keywords" lang="en" content="some content">
<meta http-equiv="content-language" content="en">
tonymarschall
  • 3,862
  • 3
  • 29
  • 52
  • I've this sample site which does not have lang word in the page: http://24-support.com/ – AgA Mar 20 '12 at 17:52
1

Some projects that might be of interest include:

Mark Butler
  • 4,361
  • 2
  • 39
  • 39
1

There is probably no perfect one solution, what you need is to have a set of checks and execute them one at a time. You probably want to start with the ones that can detect the language if the html page is well formed as per tonymarschall's answer.

As a fallback check you could use a list of english stopwords, they are used in search engines to filter out the most common words in a language . In your case you'll have to calculate their occurrences in the text portions of the html page. If they are above a certain value you can make a fairly good guess that you are looking at english text.

Try looking here for a list. Also this article shows the N-gram approach that you could also use.

yann.kmm
  • 827
  • 7
  • 21
1

I use http://www.alchemyapi.com/ to detect languages. You take a snippet of the text and pass it to their API. It detects most languages and is quite accurate. They offer a free API that allows for 1,000 requests per day which is acceptable for moderate use. Otherwise the price skyrockets.

You can also try the Google translate API:

http://code.google.com/apis/language/translate/v2/getting_started.html#language_detect

Then there's this one:

http://langid.net/identify-language-from-api.html

They offer quite a few requests for free, but I don't know how accurate they are. Definitely worth a look.

Hawkee
  • 2,027
  • 23
  • 20
  • @AgA I just updated my response with another that allows for up to 1,000 requests per hour for free. – Hawkee Mar 20 '12 at 18:42