If I have a given text (both long or short), with which methods do you usually detect which language it is written in?
It is clear that:
- You need a training corpus to train the models you use (e.g. neural networks, if used)
Easiest thing coming to my mind is:
- Check characters used in the text (e.g. hiragana are only used in Japanese, Umlauts probably only in European languages, ç in French, Turkish, …)
- Increase the check to two or three letter pairs to find specific combinations of a language
- Lookup a dictionary to check which words occur in which language (probably only without stemming, as stemming depends on the language)
But I guess there are better ways to go. I am not searching for existing projects (those questions have already been answered), but for methods like Hidden-Markov-Models, Neural Networks, … whatever may be used for this task.