How to transliterate Chinese characters to Zhuyin (in Java)

Question

How to convert Chinese traditional or simplified characters to Zhuyin phonetic notation?

Example

# simplified
没关系 --> ㄇㄟˊㄍㄨㄢㄒㄧ

# traditional
沒關係 --> ㄇㄟˊㄍㄨㄢㄒㄧ

I have written a detailed answer on this and how you can convert the transliteration from one language to another via hashmap. Have a look, it might help you:). https://stackoverflow.com/questions/19511512/transliteration-from-hindi-to-english-on-android-without-using-google-api/51557924#51557924 — Jay Dangar, Dec 06 '19 at 06:27
@Jay: so there is no actual 'algorithm' for this, right? Just need to find an established and verified translation table and that's it, no? — ccpizza, Dec 06 '19 at 06:30
It's an algorithm, just use hashmap and you are good to go. Takes O(1) time, which technically translates into 1 Microsecond. So yes, it's just writing a proper hash table and use it. You can also use faster technic if you like, and also tell me if you find it. :) — Jay Dangar, Dec 06 '19 at 06:31
@NathanHughes: I suppose we should mark them as synonyms. But will need to get enough votes for it: https://stackoverflow.com/tags/synonyms. Somebody with enough reputation for the tag can submit the synonym proposal: https://stackoverflow.com/tags/bopomofo/synonyms — ccpizza, May 21 '20 at 17:53
@ccpizza: as this is the only question tagged `bopomofo` and you're [the only person](https://stackoverflow.com/tags/bopomofo/topusers) who answered a question on this, it will be hard to find someone with more rep on this tag to suggest a synonym ;-) — jps, Jul 20 '20 at 08:29

ccpizza · Answer 1 · 2020-07-31T20:19:46.100

The Python way

The dragonmapper module does hanzi to zhuyin conversion (internally it converts first to pinyin and then to zhuyin):

# install dependencies: pip install dragonmapper

from dragonmapper import hanzi

hanzi.to_zhuyin('太阳')
>>> 'ㄊㄞˋ ㄧㄤ˙'

The Java way

The general approach is to:

convert Chinese text (Simplified or Traditional) to pinyin using pinyin4j (java), pypinyin (python), etc.
Tokenize the numbered pinyin using a regex created according to this logic (generated final regex).
Substitute pinyin tokens with zhuyin using documented mappings such as http://www.pinyin.info/romanization/bopomofo/basic.html or https://terpconnect.umd.edu/~nsw/chinese/pinyin.htm.

Possible scenario for step #1:

Java code

HanyuPinyinOutputFormat outputFormat = new HanyuPinyinOutputFormat();
outputFormat.setToneType(HanyuPinyinToneType.WITH_TONE_NUMBER);
outputFormat.setVCharType(HanyuPinyinVCharType.WITH_U_AND_COLON);
outputFormat.setCaseType(HanyuPinyinCaseType.LOWERCASE);

String[] pinyin = PinyinHelper.toHanyuPinyinStringArray(chineseText, outputFormat);

Python code

from pypinyin import pinyin

hanzi_text = '當然可以'
pinyin_text = ' '.join([seg[0] for seg in pinyin(hanzi_text)])
print(pinyin_text)

Scenario for step #2:

Provided that you generated a list of pinyin segments on step #1 you can now break the pinyin into segments and replace them using a map such as this one or this one (in js format).

Alternative approach

Another solution would be mapping Chinese characters directly to zhuyin using any of the available mappings such as this one: https://github.com/osfans/rime-tool/blob/master/data/y/taiwan.dict.yaml. The downside is that (with this particular source) this will only process Simplified Chinese but won't process Traditional characters.

UPDATE: The mapping from the libchewing project covers both simplified and traditional characters (plus frequency data and special cases for multiple characters): https://github.com/chewing/libchewing/blob/master/data/tsi.src (4.9MB). In order to be able to handle segments you'll probably also want to look for a decent Chinese segmentation library such as jieba (python), jieba-analysis (java) etc.