I'm trying to implement an Android application that has a conversation with the user via a text-to-speech and Android's speech recognition activity.
The following code starts the activity, as documented in the tutorial:
Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL,
RecognizerIntent.LANGUAGE_MODEL_FREE_FORM);
intent.putExtra(RecognizerIntent.EXTRA_PROMPT, "Speech recognition demo");
startActivityForResult(intent, VOICE_RECOGNITION_REQUEST_CODE);
The problem is the activity takes sometime between 0.5 to 1 seconds to start recording the user's voice. This doesn't seem like a lot, but this often means the user has already started talking before the speech recognition activity has begun recording, meaning the application will miss part of what the user says.
Is there a good way to get around this delay so that I can start speech recognition as soon as text to speech is done speaking?
Possibilities I've considered:
- Preload the activity in Android and pause it on start. I don't think there's any way to do this unless I have the ability to change the code within the activity, which I don't as it's not part of the Android source.
- Time the call to start the activity before the text to speech is done. This isn't ideal because it relies on undefined behavior: how long the speech recognition activity takes to load, which can vary from system to system. Additionally it requires knowledge of how long text to speech will take to say a phrase, which is not part of the text-to-speech API.
- Start the speech recognition activity and then pause the thread that it's running on. Definitely Not Recommended.
- Call methods that aren't exposed in the API from the speech recognition activity from my activity. I don't know how to do this and am not sure if it's even possible.
- Implement my own version of the speech recognition activity. This is what I'm doing now, but it's not trivial by any means and I'd rather not have to write my own FLAC encoder in Java and use Google's servers to do speech recognition without permission.
If you have any other idea of how this could be properly done or a way to get around any of the above problems that would be awesome.