I'm currently working on a multi-lang online dictionary for Japanese words and kanji. My current problem is to generate furigana for kanji-compounds within expression, sentences and words. I have the kana and kanji reading (separated) available in each case, but I don't get a reliable algorithm to work, which generates the readings for each kanji-compound based on the kana reading.
I don't need the exact reading for each kanji, which is clearly impossible based on the data I have, but it should be possible to determine the readings for all kanji-compounds since I have the full sentence/word/expression in kana.
I have:
kanji = 私は学生です
kana = わたしはがくせいです
I want to automatically assign
私 to わたし
and
学生 to がくせい.
I tried to iterate over the kanji string and check if the chars 'change' between kana and kanji and looked up until this position in the kana string. This approach works for all sentences where no kanji is followed by a hiragana syllable which is the same as the reading of the kanji ends with.
Another Idea of mine was to replace all hiragana-compounds from the kanji string in the kana, and take the left kana compounds as readings for the kanji. This clearly doesn't work in each case.
How can I write such an algorithm, which works in every case?