How can we implement 'Alexa, Simon says...' intent to capture free form speech with wide variations as text?

Question

I would like to capture anything a user says to Alexa in text form. Exactly how 'Alexa, Simon says...' works. Can someone hint at how that intent can be implemented?

I looked at this, this and this but the suggested answers don't work for me and there are no concrete 'accepted' answers to any yet.

LITERAL slot type works as long as the sample utterance is specified (i.e. hard coded literally). Like the answers suggested in the above threads, I tried to 'train' by providing 400+ combinations of possible utterances hoping that it will somehow figure out the rest of the combinations. But, no dice.

My input could be as random as 'TBD-2019-UK', '17_TBD_UK_Leicester', '17_TBD_UK_Leicester 1', '18_TBD_UK_Leicester 2', 'Chicago IL United States', etc. It is a pretty random combo of the year, city, state, country, some other key text in no particular order (lets ignore the special characters for now). Even if 'Chicago IL United States' is specified in Sample Utterances, LITERAL is not able to capture something like 'Pittsburgh PA United States' automatically unless that is also hard coded. There is no way I can come up with ALL possible permutations and combinations of year, city, state, country, some other key data points (... because it sounds impractical/ridiculous).

Plus, more values could be added by user. So it needs to be smart and dynamic.

The problem is, if there is no matching intent found for the utterance, instead of returning the user's speech text, my Alexa is just failing to do anything. It just goes off without doing anything. Any ideas?

Possible duplicate of [Amazon Alexa: store user's words](http://stackoverflow.com/questions/37249475/amazon-alexa-store-users-words) — Sam Hanley, Jun 07 '16 at 17:00
I see you expressed the fact that you don't feel this is a duplicate because the answers "don't work for you", but unfortunately, I can assure you that the answer provided on that question fully describes the closest thing possible to what you want that can be implemented with the current SDK. As I mention in the comments on that post, the "Simon Says" skill is a first-party skill so it may use non-public features - there's no source available for it. — Sam Hanley, Jun 07 '16 at 17:01

score 2 · Answer 1 · answered Jun 08 '16 at 01:21

2

Amazon's Alexa service is not designed for dictation. This has been the consistent response from the Developer Evangelists. So, quite simply, you cannot do what you desire: capture free form speech with wide variations.

There are various ways you can 'trick' Alexa into creating a 'generic slot', which I assume those links talk about. But, since it is outside the design parameters of Alexa, they will never perform well, as you have found.

For your use case, I suggest you break down your inputs into several exchanges. Don't use a one-shot invocation, but a dialog. For example:

U: Alexa, open spiffy skill
A: Welcome to spiffy skill. I'd love to do something spiffy for you, 
   but I need some information. You can give it to me by saying city,
   year, state, or country followed by what you want me to look up.
U: City Cincinatti
A: OK, Got city Cincinatti. I need more information to be spiffy. How
   Year?
U: Year 2010
A: OK, I've got Cincinatti, 2010. Should I look that up, or do you have
   more info?
U: Look it up.
A: Got it. So for Cincinatti, 2010 ...

answered Jun 08 '16 at 01:21

Joseph Jaquinta

2,118
17
15

I am aware of this and am already doing it for several intents. BUT now I am looking to write a new intent where the key value is not a simple type like a number or a year or a state. It is a combination of several data points in no particular order. Also, what if I want to update a record with user comments (dictation, like you said) like "Followed up with client today. Waiting for a response, hopefully next week. We seem to be on target so far. " This is such a common use-case and a huge miss in my opinion. – Lightning Evangelist Jun 08 '16 at 21:09
It is a technical limitation. The wider your vocabulary, the lower the quality. The more restricted the vocabulary, the higher the quality.The same will be true of all speech to text systems. You need to either work within the limitation, or switch to another platform and deal with lower quality results. – Joseph Jaquinta Jun 09 '16 at 00:21
1

I disagree. 'Simon says' intent seems to do the job perfectly. So they totally have the capability and so is not a technical limitation (even this day). – Lightning Evangelist Jun 09 '16 at 00:47
'Simon Says' is not a skill written using Alexa. It is a feature of the Echo product. The same as playing sound clips longer than 90 seconds. It may be implemented using pure playback, and not speech to text. Unless you can get Amazon to implement your skill on hardware, at the Echo level, you are limited to using the Speech-to-Text provided by the Alexa service, with the restrictions given above. – Joseph Jaquinta Jun 09 '16 at 21:19
Google Home/Assistant does free form text wonderfully. This is not a fanboy driveby. I'm now developing for Alexa but first I developed for Google Home. Amazon needs to achieve parity in this area. My two cents. – Robert Oschler Apr 25 '18 at 02:27

How can we implement 'Alexa, Simon says...' intent to capture free form speech with wide variations as text?

1 Answers1