I am trying to get corpus for a certain language. But when I get a webpage, how can I determine the language of it? Chrome can do it, but what's the principle?
I can come up with some ad-hoc methods like educated guess based on characters set, IP address, HTML tags etc. But more formal method?