I'm currently using Twilio to make phone calls and I'd like to add a speech recognition element such that if a user says a specific phrase, my backend can take specific actions. If you're familiar with Twilio, something akin to the Gather verb. It needs to be real-time since if there are issues with recognition, the user would be prompted for clarification.
5 Answers
To add speech recognition to the Twilio Gather verb, add "speech" to the Gather input value, example: input="dtmf speech". After the caller says something and is quiet, the Twilio server translates the speech in text and sends the text to the action URL, then waits for response instructions. Your program can use the text to respond how ever you choose. One choice is to have your program respond with correction instructions (Say verb) and have the caller say something more, which would be processed again by your action URL.
Twilio Gather documentation including the implementation of speech recognition: https://www.twilio.com/docs/api/twiml/gather
Example TwiML with a Gather verb using the speech recognition identifier.
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Gather input="dtmf speech" language="en-us"
numDigits="1"
timeout="6"
action="http://hostname/processUserResponse.py">
<Say voice="alice" language="en-CA">
Okay, speech recognition test. Enter any digit or say something.
</Say>
</Gather>
<Say voice="alice" language="en-CA">
Waited to long to say something. Response canceled ....
</Say>
</Response>

- 54
- 3
-
I was at Twilio's conference when they announced this! – David Jun 14 '17 at 21:41
-
As part of the official IVR example in old c# I'd used `Gather.InputEnum.DtmfSpeech` but that's not valid in .net core and thanks to this I assumed I simply specify (binary combine) both, but actually it (`input`) takes a list of enums instead,. – Tyeth Sep 25 '18 at 22:24
This was briefly covered here: https://stackoverflow.com/a/30224103/6189694
Seems like you would have to set up a conference call, and then join in as a muted user to listen in on the call.
I don't believe there is anything that works in real-time to do this. You could, however, use voice recording, pass the recording to another service (IBM's Watson Speech to Text comes to mind) and then handle it from there. It should be able to do this relatively quickly with the right workflow. I have never used Watson, just seen it used. So I am not sure on how long it would take to process the recording. I would think one or two word commands should be completed quickly.
Sorry I can't provide more guidance. Someone else in the community may have another method.

- 835
- 7
- 12
C# .net Core IVR Gather example using list of enums instead of the combined enum available in the official old C# example as per my comment above (also had to convert the url.actionurl to this monstrosity):
List<Gather.InputEnum> bothDtmfAndSpeech =
new List<Gather.InputEnum>(2){
Gather.InputEnum.Dtmf, Gather.InputEnum.Speech
};
var gather = new Gather(
action: new Uri(Url.Action("Show", "Menu")),
numDigits: 1, input:bothDtmfAndSpeech, bargeIn: true);

- 699
- 5
- 14
The IBM Watson Speech To Text service (STT) has this capability, it is called Keyword Spotting (https://www.ibm.com/watson/developercloud/doc/speech-to-text/output.shtml). Watson STT will let you push a live stream of telephony audio and produce not only recognition hypotheses but also it will be able to detect whether the user said sentences or commands specified beforehand. There is actually a demo that showcases this functionality, please give it a try:

- 770
- 3
- 6
-
2The issue is accessing the audio in real-time. I'm already aware of the Speech-to-Text capabilities of Watson, Bing, and Google that can be done after the call from a recording, but that's not sufficient. I need to route live audio from a caller to the engine. – David Nov 18 '16 at 23:41
-
1so your issue is about getting the live stream of audio during the call vs getting a recording after the call, ok, sorry for the misunderstanding – Daniel Bolanos Nov 23 '16 at 16:14