4

In our church we have a few Ukrain refugees that visit the churc. To give them un understanding of the sermon, I made an app to send the translations real-time to Telegram.

I have implemented the Google speech-to-text API following this tutorial: https://github.com/googleapis/java-speech/blob/main/samples/snippets/src/main/java/com/example/speech/InfiniteStreamRecognize.java

This works well, but the recognition is often not accurate enough. Is it possible in Google to add audio files with transcriptions so that it can learn the output of the speaker? We have always the same speaker so if I can get Google to 'know' the speaker, I think the accuracy can be much higher. Or maybe somebody has another idea how to improve the accuracy? I did try the speech adaption boost (https://cloud.google.com/speech-to-text/docs/boost), but that wasn't really helpful.

1 Answers1

1

In context to your question I consulted the documentation and found some techniques that can perhaps be helpful for your problem statement. Please find below a description of the techniques:

  1. You can improve the transcription results by using “Model adaptation” techniques. It improves the transcription results by helping Speech-to-Text recognize specific words or phrases more frequently than other options that might otherwise be suggested.

    Also in context to the technique there is a model adaptation boosting feature which can be pretty useful for fine-tuning the biasing of the recognition model. Please check the following link for more information.


  1. You can also make use of the enhanced models to improve the quality of STT. By default, STT has two types of phone call model that you can use for speech recognition, a standard model and an enhanced model. The enhanced model can provide better results at a higher price (although you can reduce the price by opting into data logging).

  1. I would also like to draw your attention to the improvement of STT by use of classes. This concept is part of model adaptation technique where classes represent common concepts that occur in natural language, such as monetary units and calendar dates. A class allows you to improve transcription accuracy for large groups of words that map to a common concept, but that don't always include identical words or phrases.

    Please note there are predefined classes available and to use a class in model adaptation, include a class token in the phrases field of a PhraseSet resource. Refer to the list of supported class tokens to see which tokens are available for your language.

    Apart from that, one other approach to improve the STT accuracy can be by addition of common phrases (single and multi-word) in the phrases field of a PhraseSet object. You can try to create a list of frequently used phrases and add that in the STT and check if there is an improvement in the outcome.

    Also, the audio channels that you are trying to transcribe, if it contains multiple channels, this document can be useful as well.

Kabilan Mohanraj
  • 1,856
  • 1
  • 7
  • 17
  • 1
    thanks for your extensive answers. I did find those options already, but what I was really looking for was more like the way Azure does it: https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-custom-speech-train-model I think I will go further on the Azure-stack, as it provides more options to customize the model. – Martijn van der Maas May 30 '22 at 12:48
  • Hello @Martin. You can let Google know that this is a feature that is important for you. However, there is no guarantee nor an ETA for the implementation. Google's [Issue Tracker](https://issuetracker.google.com/) is a place for developers to report issues and make feature requests for their development services. I'd suggest you make a feature request there. The best component to file this under would be the [AI & Machine Learning](https://issuetracker.google.com/issues/new?component=187181&template=1161183), with the `Feature Request` template. – Kabilan Mohanraj Jun 20 '22 at 08:45