4

I'm currently working on a multi-lang online dictionary for Japanese words and kanji. My current problem is to generate furigana for kanji-compounds within expression, sentences and words. I have the kana and kanji reading (separated) available in each case, but I don't get a reliable algorithm to work, which generates the readings for each kanji-compound based on the kana reading.
I don't need the exact reading for each kanji, which is clearly impossible based on the data I have, but it should be possible to determine the readings for all kanji-compounds since I have the full sentence/word/expression in kana.

I have: kanji = 私は学生です
kana = わたしはがくせいです

I want to automatically assign
私 to わたし
and
学生 to がくせい.

I tried to iterate over the kanji string and check if the chars 'change' between kana and kanji and looked up until this position in the kana string. This approach works for all sentences where no kanji is followed by a hiragana syllable which is the same as the reading of the kanji ends with.
Another Idea of mine was to replace all hiragana-compounds from the kanji string in the kana, and take the left kana compounds as readings for the kanji. This clearly doesn't work in each case.

How can I write such an algorithm, which works in every case?

jojii
  • 55
  • 4

1 Answers1

1

The standard way to do this is to use a Part-of-Speech and Morphological Analyzer like MeCab. It will split up the sentence into a bunch of tokens, and use a dictionary to generate the reading.

If you feed the CLI with your example sentence, it will look like this:

私   名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は   助詞,係助詞,*,*,*,*,は,ハ,ワ
学生  名詞,一般,*,*,*,*,学生,ガクセイ,ガクセイ
です  助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS

The next-to-last column is the reading (in katakana), and the last one is the pronunciation. For choosing which dictionary to use, check out this article.

MeCab has Python bindings (and probably for many other programming languages).

IMPORTANT NOTE: It will NOT always produce the correct readings. There are two reasons for this:

  1. The tokenization may be incorrect
  2. A word can have different readings depending on the context, whereas MeCab always uses a single reading for each word
Didier
  • 430
  • 6
  • 15
  • On a side note, I could not find any Deep Neural Net-based approach to do this, which would likely perform a lot better. Please let me know if you find something like this! – Didier Dec 04 '22 at 15:24