4

I'm using the code here to split text into individual words, and it's working great for all languages I have tried, except for Japanese and Chinese.

Is there a way that code can be tweaked to properly tokenize Japanese and Chinese as well? The documentation says those languages are supported, but it does not seem to be breaking words in the proper places. For example, when it tokenizes "新しい" it breaks it into two words "新し" and "い" when it should be one (I don't speak Japanese, so I don't know if that is actually correct, but the sample I have says that those should all be one word). Other times it skips over words.

I did try creating Chinese and Japanese locales, while using kCFStringTokenizerUnitWordBoundary. The results improved, but are still not good enough for what I'm doing (adding hyperlinks to vocabulary words).

I am aware of some other tokenizers that are available, but would rather avoid them if I can just stick with core foundation.

[UPDATE] We ended up using mecab with a specific user dictionary for Japanese for some time, and have now moved over to just doing all of this on the server side. It may not be perfect there, but we have consistent results across all platforms.

Community
  • 1
  • 1
arsenius
  • 12,090
  • 7
  • 58
  • 76

2 Answers2

3

Also have a look at NSLinguisticTagger. But by itself won't give you much more.

Truth be told, these two languages (and some others) are really hard to programatically tokenize accurately.

You should also see the WWDC videos on LSM. Latent Semantic Mapping. They cover the topic of stemming and lemmas. This is the art and science of more accurately determining how to tokenize meaningfully.

What you want to do is hard. Finding word boundaries alone does not give you enough context to convey accurate meaning. It requires looking at the context and also identifying idioms and phrases that should not be broken by word. (Not to mention grammatical forms)

After that look again at the available libraries, then get a book on Python NLTK to learn what you really need to learn about NLP to understand how much you really want to pursue this.

Larger bodies of text inherently yield better results. There's no accounting for typos and bad grammar. Much of the context needed to drive logic in analysis implicit context not directly written as a word. You get to build rules and train the thing.

Japanese is a particularly tough one and many libraries developed outside of Japan don't come close. You need some knowledge of a language to know if the analysis is working. Even native Japanese people can have a hard time doing the natural analysis without the proper context. There are common scenarios where the language presents two mutually intelligible correct word boundaries.

To give an analogy, it's like doing lots of look ahead and look behind in regular expressions.

uchuugaka
  • 12,679
  • 6
  • 37
  • 55
3

If you know that you're parsing a particular language, you should create your CFStringTokenzier with the correct CFLocale (or at the very least, the guess from CFStringTokenizerCopyBestStringLanguage) and use kCFStringTokenizerUnitWordBoundary.

Unfortunately, perfect word segmentation of Chinese and Japanese text remains an open and complex problem, so any segmentation library you use is going to have some failings. For Japanese, CFStringTokenizer uses the MeCab library internally and ICU's Boundary Analysis (only when using kCFStringTokenizerUnitWordBoundary, which is why you're getting a funny break with "新しい" without it).

一二三
  • 21,059
  • 11
  • 65
  • 74
  • Thank you for your response, 123. As I mentioned in my original question, I did try setting the locale this way, and using kCFStringTokenizerUnitWordBoundary but the results are not yet acceptable. I am checking my results against what is returned by our server, which is actually using mecab under python. I was hoping there might be some further options I might tweak. – arsenius Nov 28 '11 at 16:16
  • Do you have an example of what's "not yet acceptable"? `CFStringTokenizer` doesn't expose any of the underlying MeCab/ICU API, so your best bet is probably to compile and bundle them with your app. – 一二三 Nov 29 '11 at 11:44
  • Here's an example: "歩き 続けて いく うち に" gets returned from the server as "歩き" "続けて" "いく" "うち" "に", but with CFStringTokenizer, "続けて" changes to "続け" and "て". Yesterday I started looking at just using MeCab itself. I may not have any other option. – arsenius Nov 29 '11 at 15:52