0

(this is NOT duplicate of How to detect the language of a string?)

I need to be able to determinate the alphabet of given string (single word) by the language/alphabet specific characters. For example, if the string contains:

  • 'Ü' it should be recognized as German,
  • 'ش' as Arabic,
  • 'Φ' as Greek and etc

I'm looking for list of alphabet-specific characters listed by language/alphabet. As is single non-dictionary word using GoogleTranslate API or other dictionary based solutions won't work

(Although the question isn't programming language specific, the actual code is written in C#)

Community
  • 1
  • 1
SimSimY
  • 3,616
  • 2
  • 30
  • 35
  • 3
    While I understand what you're asking, what about using already existing language detection libraries which not only look at the included characters, but also other characteristics? If a sentence in any european language doesn't contain any such specific characters, things get terribly vague. Is *"Ich gehe heute zur Arbeit"* German, English, Dutch? – deceze Nov 08 '12 at 12:21
  • 1
    @Steve Indeed. Basically: language !== alphabet. There's hardly a 1:1 correlation. Hopefully the OP understands that. – deceze Nov 08 '12 at 14:32
  • @deceze: sorry to delete under you, I realised that my example wasn't very relevant since the questioner specifies that the input is a single word. I absolutely agree with you, though. – Steve Jessop Nov 08 '12 at 14:37
  • @deceze , Not only the input is single word, as I mentioned, I'm specifically interested in the special characters, in your example, if every word was recognized as English (As there are no special characters), a of your second reply, yes, I'm much more interested in the alphabet than in the actual language. – SimSimY Nov 11 '12 at 10:57

1 Answers1

3

You could start with the unicode name of each character. For example (in Python):

>>> import unicodedata
>>> unicodedata.name(u'Φ')
'GREEK CAPITAL LETTER PHI'
>>> unicodedata.name(u'ش')
'ARABIC LETTER SHEEN'
>>> unicodedata.name(u'Ü')
'LATIN CAPITAL LETTER U WITH DIAERESIS'

You might have to special-case the Latin characters, since Unicode doesn't assign them to particular language-specific alphabets. Most of them appear in several languages that use Latin-based alphabets, but if you're somehow confident that your data will contain Ü only if it is German, then you can identify that character as German for your purposes. There are only a few dozen Latin characters to worry about.

Similarly, loads of languages use the Unicode CYRILLIC letters, and so in most cases their presence doesn't tell you the language. Some are described by Unicode as belonging to particular languages. CYRILLIC SMALL LETTER YI has the note "Ukranian" in http://www.unicode.org/charts/PDF/U0400.pdf. I don't know whether or not those notes are exhaustive, i.e. whether or not Ukranian is the only language that uses that character. And I'm certain that there are plenty of Ukranian words that don't have that character in them. Fundamentally you cannot distinguish Ukranian words from Russian words solely by the presence or absence of Ukranian-specific letters.

I expect the same is true of other alphabets in Unicode. If you're really lucky you might find a Unicode database that includes any such notes on each character, so you can mine it for mention of particular languages.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
  • As @deceze pointed out, My question wasn't accuarte as I'm interested in the alphabet used rather than the actual lang, and I probely will need to implement something like this http://stackoverflow.com/questions/2087682/finding-out-unicode-character-name-in-net after I process the file in switch the latin-1 characters with the right language. – SimSimY Nov 11 '12 at 11:10