I simply want to know if a web page is in English or not. Is there any good way to do it?
The closest I've found is Detect language from string in PHP but it is of some use for me..
Any suggestions?
I've a sample non-English site:
I simply want to know if a web page is in English or not. Is there any good way to do it?
The closest I've found is Detect language from string in PHP but it is of some use for me..
Any suggestions?
I've a sample non-English site:
It seems that there are nearly all/a lot of possiblities to detect a language in your linked question. Why you can not use one of the proposed answers?
One more solution (but not a reliable) ist to look for meta tags with language information like:
<meta name="DC.language" content="en" scheme="DCTERMS.RFC3066">
<meta name="keywords" lang="en" content="some content">
<meta http-equiv="content-language" content="en">
Some projects that might be of interest include:
There is probably no perfect one solution, what you need is to have a set of checks and execute them one at a time. You probably want to start with the ones that can detect the language if the html page is well formed as per tonymarschall's answer.
As a fallback check you could use a list of english stopwords, they are used in search engines to filter out the most common words in a language . In your case you'll have to calculate their occurrences in the text portions of the html page. If they are above a certain value you can make a fairly good guess that you are looking at english text.
Try looking here for a list. Also this article shows the N-gram approach that you could also use.
I use http://www.alchemyapi.com/ to detect languages. You take a snippet of the text and pass it to their API. It detects most languages and is quite accurate. They offer a free API that allows for 1,000 requests per day which is acceptable for moderate use. Otherwise the price skyrockets.
You can also try the Google translate API:
http://code.google.com/apis/language/translate/v2/getting_started.html#language_detect
Then there's this one:
http://langid.net/identify-language-from-api.html
They offer quite a few requests for free, but I don't know how accurate they are. Definitely worth a look.