Using Google SR Plugin and Dynamic Speech Contexts to increase performance of Google Cloud Speech-to-text API and Dialogflow

Question

Task: We are attempting to build a Dialogflow agent that will interact with callers via our Cisco telephony stack. We will be attempting to collect alphanumeric credentials from the caller.

Here is our proposed architecture:

Problem: In order to send text inputs to Dialogflow, we are using Google Cloud’s Speech-to-Text (STT) API to convert the caller’s audio to text. However, the STT API does not always perform as desired. For example, if a caller wishes to say his/her DOB is 04-04-90, the transcribed audio may come back as oh for oh 490. Yet, the transcribed audio can be greatly improved by passing phrase hints to the API, and so we would need to dynamically send these hints based on the scenario. Unfortunately, we're struggling to understand how we can dynamically pass these phrase hints through the UniMRCP server, specifically the Google Speech Recognition plugin.

Question: Section 5.2 of the Google Speech Recognition manual outlines using Dynamic Speech Contexts.

The example provided is:

<grammar mode="voice" root="booking" version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/06/grammar">
    <meta name="scope" content="hint"/>
    <rule id="booking">
        <one-of>
            <item> 04 04 1990</item>
            <item> 04 04 90</item>
            <item> April 4th 1990</item>
        </one-of>
    </rule>
</grammar>

Does this still transcribe all user input similar to how the builtin grammar builtin:speech/transcribe would behave?

For example, if I were to say March 5th 1980, would the Google’s STT return March 5th 1980, or only one of the provided items?

To be clear, I would want Google’s STT to be able to return more than just the provided items, and so if the user says March 5th 1980, I would want that returned through the UniMRCP, VBB, CVP and passed along to Dialogflow. I am being told that even if STT returned March 5th 1980 the CVP or the Voice browser would potentially evaluate it as "no match".

score 1 · Answer 1 · answered Jan 29 '19 at 20:25

1

Dialogflow accepts more than text inputs.

It can either do intent detection based on audio or an audio stream.

answered Jan 29 '19 at 20:25

Prisoner

49,922
7
53
105

Good point. Let me adjust my question. I should really have stated that based on the architecture we are exploring, we will be sending text to dialogflow, that have been converted from speech using Google STT. However, if we provided Audio to dialogflow, would they not just be using the STT API under the hood to then compare the text to the intents? – Ryan Stack Jan 29 '19 at 20:32
Yes, but you can provide sample phrases to Dialogflow which include Date Entity types which it can use to do phrase shaping for the Cloud STT. – Prisoner Jan 29 '19 at 20:37
So just to clarify, when sending audio to Dialogflow, Dialogflow will use the Date Entity types we provided in the training phrases as phrase hints to Cloud STT? That would we very very nice if true. – Ryan Stack Jan 29 '19 at 20:41
That is my understanding, although I haven't tested it myself. As always, you should test to see if they work as well as you'd like. – Prisoner Jan 29 '19 at 20:42
of course, I was not even aware of that capability however, so thank you very much for bringing that to my attention – Ryan Stack Jan 29 '19 at 20:43

Using Google SR Plugin and Dynamic Speech Contexts to increase performance of Google Cloud Speech-to-text API and Dialogflow

1 Answers1