5

Is there a way in .NET to determine the a script family based on an input string? For example, I have the following:

咖啡, กาแฟ, コーヒー, قهوة

("coffee" in Chinese, Thai, Japanese and Arabic, respectively)

Is there a way to determine what script these are in, like a general script family (for example, it may be "Hans/Hant", "Thai", "Jpan", "Arab") - these are IANA / ISO 15924 groupings?

Todd Main
  • 28,951
  • 11
  • 82
  • 146
  • Similar question -> http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file – Chris Baxter Jul 10 '11 at 01:53
  • Take this example string "咖啡 café". They are not necessarily different codepages, but they are different scripts. The example you gave focuses on codepages - I'll edit that part out of the question. – Todd Main Jul 10 '11 at 02:10
  • I am almost certain this is not possible in pure .Net. Are you willing to use external dll's? – Paweł Dyda Jul 12 '11 at 08:30
  • @Pawel Dyda: unfortunately, no, I can't use external DLLs – Todd Main Jul 13 '11 at 19:53
  • @Otaku: As I wrote, .Net does not seem to have this kind of information... I can't help it, my best guess would be to use ICU4C or some native calls (I believe Char Map obtains these information from somewhere). – Paweł Dyda Jul 13 '11 at 20:09
  • Could you use Google Translate API to determine the language? – Huske Jul 18 '11 at 12:39
  • Or IE's language detection API IMultiLanguage2 – Sheng Jiang 蒋晟 Jul 18 '11 at 17:30
  • @Sheng Jiang 蒋晟: that's a good lead. I can't find a list of the SCRIPT_IDs however to see if I can get the script name. – Todd Main Jul 18 '11 at 19:49

1 Answers1

3

I had a similar problem (detecting alphabet/script to count words) and I ended up checking every character to see in which Unicode block it's included, thus determining how to treat it. Basically, you have that different Chinese, Japanese, Arabic and Thai "alphabets" are defined in separated Unicode blocks.

Dario Solera
  • 5,694
  • 3
  • 29
  • 34
  • 3
    how about those CJK characters that means different things in different languages? 大丈夫 means "all right" in Japanese but "great man" in Chinese. – Sheng Jiang 蒋晟 Jul 18 '11 at 17:18
  • As far as I know, but I might be wrong as I didn't need this king of accuracy, I think that even if the characters are equal in Chinese and Japanese, they're treated as different in the Unicode standard, thus you can still recognize the family. – Dario Solera Jul 19 '11 at 06:20
  • 1
    No, font in different languages would render them differently but as far as encoding is concerned their binary representation is the same. The word, without other context, would be valid in both languages. – Sheng Jiang 蒋晟 Jul 19 '11 at 16:47
  • @Sheng: the OP isn't asking for *language*, he's asking for *script family*, so this answer might work. – egrunin Jul 26 '11 at 20:30