3

I understand that Watson Speech To Text is somewhat calibrated for colloquial conversation and for 1 or 2 speakers. I also know that it can deal with FLAC better than WAV and OGG.

I would like to know how could I improve the algorithm recognition, acoustically speaking.

I mean, does increasing volume help? Maybe using some compression filter? Noise reduction?

What kind of pre processing could help for this service?

Leo
  • 751
  • 4
  • 29

1 Answers1

4

the best way to improve the accuracy of the base models (which are very accurate but also very general) is by using the Watson STT customization service: https://www.ibm.com/watson/developercloud/doc/speech-to-text/custom.html. That will let you create a custom model tailored to the specifics of your domain. If your domain is not very well matched to those captured by the base model then you can expect a great boost in recognition accuracy.

Regearding your comment " I also know that it can deal with FLAC better than WAV and OGG", that is not really the case. The Watson STT service offers full support for flac, wav, ogg and other formats (please see this section of the documentation: https://www.ibm.com/watson/developercloud/doc/speech-to-text/input.html#formats).

Daniel Bolanos
  • 770
  • 3
  • 6
  • Thanks Daniel. Right now, we're trying exactly the approach you've suggested. Watson seems to deal pretty well with acronyms, which is great,so we're enriching the corpus using the customization tool. However, since we're dealing with conference audio, we have heterogeneous audio quality for each speaker. While some have a clear audio, other have a very poor audio quality. However, I understand what makes an audio clear for humans not necessarily is the same for machine learning algorithms. In this case, I'd like to know if is there any audio filter that could help (volume boost? compression?) – Leo Jul 31 '17 at 14:16
  • 1
    preprocessing the audio by applying a filter has the potential to introduce a mismatch and you could further degrade recognition accuracy. You can definitely experiment with it, but probably the best would be doing am-customization, stay tuned regarding that feature. I have a question for you, what is the audio encoding you use? I wonder if you are losing something because of that. – Daniel Bolanos Jul 31 '17 at 17:04
  • the original audio, captured probably using lync (skype for business) was a windows media video file, so I guess the internal audio format was wma with some microsoft owned codec.I've tried both broadband and narrowband settings for S2T and broadband seems to work better in this case. It's a pity that we can't customize S2T using audio samples. We have to rely on S2T default acoustic settings (of course, this is easier than asking the user to tune its own speech recognition method). – Leo Jul 31 '17 at 17:38
  • As you've said, preprocessing the audio without knowing exactly what S2T is expecting on its side can make things better or worse, so what I am doing right now is experimenting (compression seems to improve accuracy, noise reduction seems to do the opposite). The idea of this question was exactly to explore some preprocessing tips that could work in this task in general. Specifically, what kind of preprocessing procedures I could try in order to improve the accuracy (I understand there's no recipe here) – Leo Jul 31 '17 at 17:40
  • @Leo I am currently struggling with this same question.4 years after this question was posted. Were you eventually able to improve the accuracy of the base model when it transcribed recorded audio files? What worked? also I am doing this in python and am currently on the Lite plan. Any suggestions? –  Mar 25 '21 at 00:02