I am trying to add a continuous speech to text recognizer in a mobile application during a webrtc audio-only call.
I'm using react native on the mobile side, with the react-native-webrtc module and a custom web api for the signaling part. I've got the hand of the web api, so I am able to add the feature on it's side if it's the only solution, but I prefer to perform it on the client side to avoid consuming bandwidth if there is no need.
First, I have worked and tested some ideas with my laptop browser. My first idea, was to use the SpeechRecognition interface from the webspeechapi : https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition
I have merged the audio only webrtc demo with the audiovisualiser demonstration in one page but there, I did not find how to connect a mediaElementSourceNode
(created via AudioContext.createMediaElementSource(remoteStream)
at line 44 of streamvisualizer.js) to a web_speech_api SpeechRecognition
class. In the Mozilla documentation, the audio stream seems to came with the constructor of the class, which may call the getUserMedia()
api.
Second, during my researches I have found two open source speech to text engine : cmusphinx and mozilla's deep-speech. The first one have a js binding and seems great with the audioRecoder
that I can feed with my own mediaElementSourceNode
from the first try. However, how to embed this in my react native application?
There are also Android and iOS natives webrtc modules, which I may be able to connect with cmusphinx platform specific bindings (iOS, Android) but I don't know about native classes inter-operability. Can you help me with that?
I haven't already created any "grammar" or define "hot-words" because I am not sure of technologies involved, but I can do it latter if I am able to connect a speech recognition engine to my audio stream.