Eventhough the thread is old, I might have some good results to share for anyone who would pass by.
The "trick" I am using is to actually make a letter correspond to a "word", or something close:
recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
guard let recognitionRequest = recognitionRequest else { fatalError("Unable to create a SFSpeechAudioBufferRecognitionRequest object") }
recognitionRequest.shouldReportPartialResults = true
// Associate a letter to a word or an onomatopoeia
var letters = ["Hey": "A",
"Bee": "B",
"See": "C",
"Dee": "D",
"He": "E",
"Eff": "F",
"Gee": "G",
"Atch": "H"]
// This tells the speech recognition to focus on those words
recognitionRequest.contextualStrings = letters.key
Then, when receiving the audio in recognitionTask
, we access the dictionary to detect which letter the word is associated to.
recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in
var isFinal = false
if let result = result {
isFinal = result.isFinal
let bestTranscription = result.bestTranscription
// extract the confidence the recognizer has in this word
let confidence = bestTranscription.segments.isEmpty ? -1 : bestTranscription.segments[0].confidence
print("Best \(result.bestTranscription.formattedString) - Confidence: \(confidence)")
// Only keep results with some confidence
if confidence > 0 {
// If the transcription matches one of our keys we can retrieve the letter
if letters.key.map({ $0.lowercased() }) .contains(result.bestTranscription.formattedString.lowercased()) {
let detected = result.bestTranscription.formattedString
print("Letter: \(letters[detected])")
// And stop recording afterwards
self.stopRecording()
}
}
}
if error != nil || isFinal {
// The rest of the boilerplate from Apple's doc sample probect...
}
}
Notes:
- its important to set
shouldReportPartialResults
to true, otherwise it waits quite a while before sending the result
- After some tests, it seems that when you are setting
recognitionRequest.contextualStrings
, the confidence tends to skyrocket when it recognizes one of those strings. You could probably increase the confidence treshold to 0.3 or 0.4
- It might takes a long time to finish the alphabet, since sometimes one word gets recognize for another. It took a lot of trial and error to get good results for the first 8 letters (ex: I tried "age" for "H", but it kept being recognized as "Hey", i.e. "A")
Some results:
Best Gee - Confidence: 0.0
// ... after a while, half a second maybe ...
Best Gee - Confidence: 0.864
Found G
(Apple's sample project to test it out: https://developer.apple.com/documentation/speech/recognizing_speech_in_live_audio)