2

If I have a given text (both long or short), with which methods do you usually detect which language it is written in?

It is clear that:

  • You need a training corpus to train the models you use (e.g. neural networks, if used)

Easiest thing coming to my mind is:

  • Check characters used in the text (e.g. hiragana are only used in Japanese, Umlauts probably only in European languages, ç in French, Turkish, …)
  • Increase the check to two or three letter pairs to find specific combinations of a language
  • Lookup a dictionary to check which words occur in which language (probably only without stemming, as stemming depends on the language)

But I guess there are better ways to go. I am not searching for existing projects (those questions have already been answered), but for methods like Hidden-Markov-Models, Neural Networks, … whatever may be used for this task.

aufziehvogel
  • 7,167
  • 5
  • 34
  • 56
  • possible duplicate of [Return the language of a given string](http://stackoverflow.com/questions/1192768/return-the-language-of-a-given-string) (and numerous others) – Fred Foo May 18 '12 at 06:52
  • In this case, there is a (non-accepted) answer which gives a bit more detail, but the usual answers on such questions are: "You can use project A in python or project B in C++" without giving any details on what methods are used in general (see my last sentence). – aufziehvogel May 18 '12 at 14:26
  • I'm pretty sure the [Cavnar & Trenkle algorithm](http://www.nonlineardynamics.com/trenkle/papers/sdr94ps.gz) has been mentioned several times on SO. – Fred Foo May 18 '12 at 16:22

2 Answers2

2

In product I'm working on we use dictionary-based approach. First relative probabilities for all words in training corpus are calculated and this is stored as a model.

Then input text is processed word by word to see if particular model gives best match (much better then the other models).

In some cases all models provide quite bad match.

Few interesting points:

  1. As we are working with social media both normalized and non-normalized matches are attempted (in this context normalization is removal of diacritics from symbols). Non-normalized matches have a higher weight
  2. This method works rather bad on very short phrases (1-2 words) in particular when these words are there in few languages, which is the case of few European languages

Also for a better detection we are considering added per-character model as you have described (certain languages have certain unique characters)

Btw, we use ICU library to split words. Works rather good for European and Eastern languages (currently we support Chinese)

Alex Z
  • 1,867
  • 1
  • 29
  • 27
0

Check the Cavnar and Trenkle algorithm.

aufziehvogel
  • 7,167
  • 5
  • 34
  • 56