1

I am creating a crawler that downloads web page documents from website and stores the web page content in database.
I want to store only documents that are in English language.
I can't manage to find out which language is a specific web page in so that i could decide to store it in the database or not.

gunr2171
  • 16,104
  • 25
  • 61
  • 88
  • 2
    Possible duplicate of [How to detect the language of a string?](https://stackoverflow.com/questions/1192768/how-to-detect-the-language-of-a-string) – PaulF Mar 18 '19 at 16:52
  • You could determine the frequency of the most common words: the, be, to ... – Stefan Mar 18 '19 at 16:54
  • Possible duplicate of [How to determine the language of a website](https://stackoverflow.com/questions/35209243/how-to-determine-the-language-of-a-website) – gunr2171 Mar 18 '19 at 17:02
  • what if its half english and half french? What if the english portion is Ads and the content is German? – Steve Mar 18 '19 at 17:03
  • 1
    There are, sometimes, meta-tags that identify the language of the page. These may or may not be helpful and I don't know that there is a standard meta-tag to look for. – Mark Sholund Mar 18 '19 at 18:10

2 Answers2

0

You should use a language recognition. There are some APIs that you can use. It basically consists on that you send the text and it would return the language.

Also you could build your own API by applying some machine learning, where you put several examples of what "English" text means.

I would recommend you to look up in google "Language recognition api" or something like that so you have a more clear idea.

Brank Victoria
  • 1,447
  • 10
  • 17
0

I suspect there is no 'one' way of doing this. Some HTML pages will declare their language, many/most will not. You will have to come up with some heuristic that determines the language from several methods and decide based upon that info.

Maybe some weighting:

  • HTML declaration = 0.75
  • 90% of innerText is 'english' = 0.50
  • etc etc etc (can't think of another test)

Then decide if you have reached a reasonable value that you say 'this is definitely english' and off you go.

Neil
  • 11,059
  • 3
  • 31
  • 56