how to perform continuous speech to text on webrtc communication audio stream in mobile app

Question

I am trying to add a continuous speech to text recognizer in a mobile application during a webrtc audio-only call.

I'm using react native on the mobile side, with the react-native-webrtc module and a custom web api for the signaling part. I've got the hand of the web api, so I am able to add the feature on it's side if it's the only solution, but I prefer to perform it on the client side to avoid consuming bandwidth if there is no need.

First, I have worked and tested some ideas with my laptop browser. My first idea, was to use the SpeechRecognition interface from the webspeechapi : https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition

I have merged the audio only webrtc demo with the audiovisualiser demonstration in one page but there, I did not find how to connect a mediaElementSourceNode (created via AudioContext.createMediaElementSource(remoteStream) at line 44 of streamvisualizer.js) to a web_speech_api SpeechRecognition class. In the Mozilla documentation, the audio stream seems to came with the constructor of the class, which may call the getUserMedia() api.

Second, during my researches I have found two open source speech to text engine : cmusphinx and mozilla's deep-speech. The first one have a js binding and seems great with the audioRecoder that I can feed with my own mediaElementSourceNode from the first try. However, how to embed this in my react native application?

There are also Android and iOS natives webrtc modules, which I may be able to connect with cmusphinx platform specific bindings (iOS, Android) but I don't know about native classes inter-operability. Can you help me with that?

I haven't already created any "grammar" or define "hot-words" because I am not sure of technologies involved, but I can do it latter if I am able to connect a speech recognition engine to my audio stream.

may i ask, how have you solved it? I am working on a similar project, don't know how to combine Speech Recognition with web calls, thanks! — user12595983, Dec 13 '21 at 03:24

Nikolay Shmyrev · Accepted Answer · 2019-05-09T17:49:39.063

1

You need to stream the audio to the ASR server by either adding another webrtc party on the call or by some other protocol (TCP/Websocket/etc). On the server you perform recognition and send results back.

First, I have worked and tested some ideas with my laptop browser. My first idea, was to use the SpeechRecognition interface from the webspeechapi : https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition

This is experimental and does not really work in Firefox. In Chrome it only takes microphone input directly, not dual stream from caller and callee.

The first one have a js binding and seems great with the audioRecoder that I can feed with my own mediaElementSourceNode from the first try.

You will not be able to run this as local recognition inside your react native app

edited May 09 '19 at 17:49

answered May 09 '19 at 17:44

Nikolay Shmyrev

24,897
5
43
87

About the mediaElementSourceNode audioRecorder solution based, why I won't be able to run it inside a react native app ? event with a webview ? – Oscar May 10 '19 at 13:22
It will be too slow to transcribe anything serious. https://stackoverflow.com/questions/25949295/cmusphinx-pocketsphinx-recognize-all-or-large-amount-of-words – Nikolay Shmyrev May 11 '19 at 07:30
I have not been precise enough, here is the new post : https://stackoverflow.com/questions/56102224/does-react-native-webview-can-perform-webrtc-call-audio-hot-keywords-speech-to with an update about my experimentations ! – Oscar May 12 '19 at 18:30

how to perform continuous speech to text on webrtc communication audio stream in mobile app

1 Answers1