0

Is there a programmatic way to identify the tones in Chinese text?

For an input string like 苹果 I need to extract the tones as 2 and 3 as it would be indicated in the pinyin transliteration píng guǒ or ping2 guo3.

I guess a possible workaround would be converting Chinese hanzi text to pinyin (e.g. with pinyin4j) and then extract the tones from pinyin, but I assume there must be a better and direct way to do it.

Context

The question is about if there is some algorithmic way to identify the tones or if the only way is a map lookup against an authoritative source e.g. the publicly available CEDICT database.

ccpizza
  • 28,968
  • 18
  • 162
  • 169
  • 1
    perhaps this might be something you're looking for? https://github.com/sxei/pinyinjs It does include a getTone() function – Harrison Sep 19 '20 at 07:39
  • I'm a native speaker, and I doubt that it's possible. Chinese character can have multiple tones depending on the context. The only reliable way to do this is to call some APIs with the full context. – Kai Hao Sep 19 '20 at 08:53
  • @KaiHao: so, if I understand you correctly this means that there is no algorithm that can be used to 'convert' hanzi to tone info, and therefore one _must_ use a mechanism similar to what pinyin converters do: segment hanzi to known character groups, then to single characters and then look them up in a map, right? – ccpizza Sep 19 '20 at 09:01
  • @ccpizza That's correct. Since you can't be sure what tone the character is just by judging it individually, there's no such "algorithm" to map them to their tones. For instance, "一" can be tone 1, 2, 4, or neutral depending on the context. – Kai Hao Sep 19 '20 at 13:14

1 Answers1

0

I'm a native speaker, and I doubt that it's possible. Chinese character can have multiple tones depending on the context. The only reliable way to do this is to call some APIs with the full context.

Since you can't be sure what tone the character is just by judging it individually, there's no such "algorithm" to map them to their tones.

For instance, "一" can be tone 1, 2, 4, or neutral depending on the context.

Kai Hao
  • 669
  • 5
  • 12
  • Based on your response I conclude that the approach is building a tone map based on existing pinyin data starting with the longest segments, for example, https://github.com/belerweb/pinyin4j/blob/master/src/main/resources/pinyindb/multi_pinyin.txt and then process the remaining individual characters e.g. using this map: https://github.com/belerweb/pinyin4j/blob/master/src/main/resources/pinyindb/unicode_to_hanyu_pinyin.txt. Or use the cedict-based mapping that can be generated with scripts from https://github.com/mozillazg/phrase-pinyin-data. – ccpizza Sep 19 '20 at 13:37
  • @ccpizza That would be a great start! The best solution, however, is to use an existing trained API. Pre-defined maps are not gonna cover all cases. For instance, "分校" can both be "fen1 xiao4" and "fen1 jiao4" depending on the context. This is a rather rare case though, and even well-trained APIs could be wrong about that either. – Kai Hao Sep 20 '20 at 04:16
  • you are right; in my tests I have seen that all static pinyin conversion libraries (that I know of) have an error rate of ~10% as described in this question https://chinese.stackexchange.com/questions/40284/examples-of-correct-and-incorrect-pinyin and in the links in the answers. It can be reduced to some degree by extending the phrases file (i.e. the `multi_pinyin.txt` in pinyin4j) but that still leaves at least a 5%+ error rate (mostly with tones). ..for the time being a static map would be good enough for my use case. – ccpizza Sep 20 '20 at 06:31