1

Is there any available library for word-counting of some hieroglyphics language (ex: chinese, japanese, korean...)?

I found that MS Word count effectively texts in these languages. Can I add reference to MS Word libraries in my .NET application to implement this function?

Or is there any other solutions to achieve this purpose?

Jin Ho
  • 3,565
  • 5
  • 23
  • 25
  • 1
    **Hieroglyphics**? They're not!!! :) Japanese and Chinese text is made of characters exactly as western languages (but one character is/may be a word, you can count them if you remember you have to count characters and not bytes). Korean has a phonetic alphabet... – Adriano Repetti Jul 30 '13 at 07:21
  • Even hieroglyphics were "a formal writing system used by the ancient Egyptians that combined logographic and alphabetic elements", actually. – KappaG3 Jul 30 '13 at 08:53

1 Answers1

3

s there any available library for word-counting of some hieroglyphics language (ex: chinese, japanese, korean...)?

Hieroglyphics? No, they're not. They're logographic characters and it's not so subtle difference. I'm sure some native speaker may explain this much better than me.

Japanese and Chinese text is made of characters exactly as western languages but one character may be a word to. Moreover they don't need spaces to separate words so our distinction characters/words can't be made using blanks as delimiters.

What Word does is to count words (assuming they'll be equal to characters) and you can do the same in your code (just don't forget it's UNICODE so you can't count bytes) counting characters. To count real words you need a dictionary (because you can't rely on spaces).

For example these strings:

这是一个示例文本

これは、サンプルのテキストです

Will be counted as 8 characters and 8 words (in Chinese) and 15 characters and 15 words in Japanese. Actually it's not (for example in Japanese it's 5 words when transliterated in romaji). Moreover don't forget in Japanese they have more than one alphabet (and one family of them are phonetic).

What's the point? What you will count? Words transliterated to one of phonetic representations (with latin characters) we use to represent them? Which one? Word counting will be pretty different and it'll actually count our concept of words (that's why, I suppose, Word counts characters).

That said now try to write this code:

string text = "这是一个示例文本";
MessageBox.Show(text.Length.ToString());

It'll display 8, as Word does (we're counting characters), in bytes (supposing an UTF-8 encoding) is 24. No sense to count spaces here. If you plan to count words in one transliteration you need to use an external library (it's not an easy task to do it by yourself), a different one for each language you want to support (somehow it's easy to auto detect the language because in Japanese they use very often hiragana/katakana characters). Which one? There are a lot of them, I don't know for Chinese but in Japanese a popular one to transliterate Kanji is Kakasi.

Korean is a complete different story, it's an alphabet exactly as latin one but character (that should be called syllable) may be composed of many letters. Again they don't need spaces so you can't rely on them for word counting. It's somehow more complicated because here you may need a dictionary even for character counting (otherwise you'll just count syllables).

Community
  • 1
  • 1
Adriano Repetti
  • 65,416
  • 20
  • 137
  • 208
  • Even though Korean has space between characters. I noticed that MS Word count word in text by counting the character. It works the same way for Chinese and Japanese. The interesting thing I realized is the word count function of Google Doc work incorrectly in these languages :D – Jin Ho Jul 30 '13 at 11:12
  • @JinHo They (Koreans) have spaces (same as Chinese and Japanese) but it's not strictly required. It's pretty common to see a long text in any of that languages without any space. Well...yes... _our_ software was not (and somehow is still not) so good for far east languages (especially when they are too different from our _concepts_). More than often globalization/localization process is not just UNICODE support (but we're slow to understand it). – Adriano Repetti Jul 30 '13 at 11:28
  • @JinHo Oh, you're from Vietnam!!! Great country!!! Even if, more than once, I feared to die crossing the street in Ho Chi Minh! :) – Adriano Repetti Jul 30 '13 at 11:34
  • 1
    Street in HCM is crowded but it's safe. People here are very good at driving, you can trust them :D – Jin Ho Aug 02 '13 at 09:11