Why word-level language model should help in beam search decoding in ASR?

Question

I was experimenting with beam search decoding of an acoustic model trained with CTC loss trained on an automatic speech recognition task. The version I was using was based on this paper. However, even though many sources describe integration of similar word level language model as beneficial to word error rate performance, in my case, the integration of LM worsened the results.

It actually does not surprise me too much, because the language model scores only prefixes with finished words at the end, and scoring means multiplying the probability of the prefix by the LM probability, which decreases the probability of the whole prefix. This way, probability of prefixes that end with a word from a vocabulary is systematically decreased by the language model, while the prefixes that do not end with a complete word yet are not scored by the LM at all. At each time step, the prefixes ending with complete words seem to be discarted due to the lowered score, while the incomplete prefixes survive in the beam.

My question is, why should word level LM integration work, if it decreases the probability of valid prefixes? I would understand that some character-level LM that scores everything at every step or some look-ahead word-level LM could help. For example Graves describes the integration of a word-level language model by using sums of probabilities of all posible word with given prefix and by applying the LM update at each time step, which seems reasonable even though the computational cost could be much larger.

Have you tried word beam search? Accuracy of course also heavily depends on the training data of the language model. See: https://github.com/githubharald/CTCWordBeamSearch — Harry, Jun 09 '20 at 20:26
Hi, yes i tried word beam search with prefix tree to restrict the characters, but the results are very similar to beam search without alphabet limitations (if i use LM). The word beam search in itself does not solve the problem of scoring some prefixes earlier than other. I used a bigram model to make sure that the probabilities are not too low. I also tried mixing my dataset with large corpus to train the LM, but got even worse results. The greedy decoding is giving me the best results so far. — JAV, Jun 10 '20 at 06:36
@Harry What should the dataset for the LM look like in order to be beneficial? Also how does a good word-level language model solve the discarting of scored hypotheses that I describe in my question? — JAV, Jun 10 '20 at 09:31
there is a forecast mode which sums over all word-probabilities that could still be created from the current (unfinished) word. This should help keeping beams with unfinished words as it scores the beam according to what it could be "in the future". Further, I never had the problems you mentioned, but I guess this is due to the small number of time-steps involved (~100). Also, is your beam-width large enough to hold enough hypothesis? https://github.com/githubharald/CTCWordBeamSearch/blob/master/cpp/src/Beam.cpp#L58 — Harry, Jun 10 '20 at 11:45
Thanks for your answers. I also experimented with this mode, but the program fails on memory after certain number of decoded sentences so I was not able to measure WER on my entire dataset yet. I used beam sizes 25 and 40. I also reimplemented the algorithm into logarithmic domain for stability, because some of my sentences are quite long. Maybe the training data are too small to get reasonable results? Would you recommend sufficient corpus size? — JAV, Jun 10 '20 at 14:49
ok, 25 to 40 is not very much. How long are your input sequences? I'm just wondering why "Words" mode did not improve your results. Do you have too many OOV words in your test set? How big is the OOV rate? Do you have an example for a CTC input + corpus + settings for which you get results worse than with greedy decoding? If yes, you could share it such that I could have a look. (P.S.: Do you have a github account - maybe we could switch to github (issue), because SO is really a pain to communicate). — Harry, Jun 10 '20 at 15:30
The logits may easily have around 1000 steps, a sentence can have 15-20 words. Alright, I will make an issue and append necessary files ond info about OOV rates tommorow morning :) — JAV, Jun 10 '20 at 16:38

Why word-level language model should help in beam search decoding in ASR?

0 Answers0