4

If I have a string in java how can I determine which language it belongs to? Does Unicode specification allow us to do it?

Rajesh
  • 632
  • 8
  • 16
  • 3
    Wow, do you mean you want to determine which language `.` belongs to ? Good luck :-) i hope you'll prefer an ordered list of "possible languages" – Riduidel Apr 07 '11 at 14:07

1 Answers1

6

There is no metadata in an Unicode string that specifies what language the string is in, if the string is even a word or phrase.

Based on the characters contained in the string, you may be able to guess what language is being used. For example, Unicode range 30A0–30FF represents Japanese Katakana characters. So if most of your string consists of characters within that range, you could make an educated guess that it's Japanese. This is not at all reliable, though. For instance, what if it's just random Katakana characters?

For reliable language detection, I would abandon all thought of using Unicode as a basis for language detection and focus on language recognition algorithms.

Jeff
  • 21,744
  • 6
  • 51
  • 55