2

I am trying to get corpus for a certain language. But when I get a webpage, how can I determine the language of it? Chrome can do it, but what's the principle?

I can come up with some ad-hoc methods like educated guess based on characters set, IP address, HTML tags etc. But more formal method?

Josh Crozier
  • 233,099
  • 56
  • 391
  • 304
Kun Wu
  • 329
  • 2
  • 8
  • Possible duplicate, or at least good answers to the same question: http://stackoverflow.com/questions/1464362/detect-language-of-text Basically there are lots of tools out there to do this for you, just pick a library that works well for your particular needs and use it. One question that may be relevant, what language are you seeking a corpus for? Some tools are better at certain languages or families of languages than others. – Thaeli Nov 08 '11 at 03:39

3 Answers3

0

If you are just interested in collecting corpora of different languages, you can look at country specific pages. For example, <website>.es is likely to be in Spanish, and <website>.de is likely to be in German.

Also, Wikipedia is translated into many languages. It is not hard to write a scraper for a particular language.

mayhewsw
  • 704
  • 9
  • 20
0

The model that determines a webpage's language in Chrome is called the Compact Language Detector v3 (CLD3) and it's open source C++ code (sort of, it's not reproducible). There's also official Python bindings for it:

pip install gcld3
Boris Verkhovskiy
  • 14,854
  • 11
  • 100
  • 103
0

I suppose the common method is looking at things like letter frequencies, common letter sequences and words, character sets (as you describe)... there are lots of different ways. An easy one would be to just get a bunch of dictionary files for various languages and test which one gets the most hits from the page, then offer, say, the next three as alternatives.

Ry-
  • 218,210
  • 55
  • 464
  • 476