How can I determine the language of a web page, like Chrome does?

Question

I am trying to get corpus for a certain language. But when I get a webpage, how can I determine the language of it? Chrome can do it, but what's the principle?

I can come up with some ad-hoc methods like educated guess based on characters set, IP address, HTML tags etc. But more formal method?

Possible duplicate, or at least good answers to the same question: http://stackoverflow.com/questions/1464362/detect-language-of-text Basically there are lots of tools out there to do this for you, just pick a library that works well for your particular needs and use it. One question that may be relevant, what language are you seeking a corpus for? Some tools are better at certain languages or families of languages than others. — Thaeli, Nov 08 '11 at 03:39

score 0 · Answer 1 · answered Nov 04 '13 at 19:46

If you are just interested in collecting corpora of different languages, you can look at country specific pages. For example, <website>.es is likely to be in Spanish, and <website>.de is likely to be in German.

Also, Wikipedia is translated into many languages. It is not hard to write a scraper for a particular language.

Boris Verkhovskiy · Answer 2 · 2021-03-23T19:12:57.043

0

The model that determines a webpage's language in Chrome is called the Compact Language Detector v3 (CLD3) and it's open source C++ code (sort of, it's not reproducible). There's also official Python bindings for it:

pip install gcld3

edited Mar 23 '21 at 19:12

answered Aug 05 '19 at 21:50

Boris Verkhovskiy

14,854
11
100
103

score 0 · Accepted Answer · answered Nov 08 '11 at 03:23

0

I suppose the common method is looking at things like letter frequencies, common letter sequences and words, character sets (as you describe)... there are lots of different ways. An easy one would be to just get a bunch of dictionary files for various languages and test which one gets the most hits from the page, then offer, say, the next three as alternatives.

answered Nov 08 '11 at 03:23

Ry-

218,210
55
464
476

1

"n-gram analysis" is generally the technique used for this. – Thaeli Nov 08 '11 at 03:48

How can I determine the language of a web page, like Chrome does?

3 Answers3