Single spoken letters with SFSpeechRecognizer?

Question

I am writing a spelling bee app. I have been using SFSpeechRecognizer, but it doesn't do very well with single letter cause I'm guessing it's looking for spoken phrases.

I've been googling SFSpeechRecognizer for a while and haven't found much in regards to getting it to recognize single letters.

I have had to generate a list of things that SFSpeechRecognizer kicks out when letters are said and just validate based on that list.

Is there some setting in SFSpeechRecognizer that will make it handle single spoken letters better?

Or is there a alternate framework that might be better suited for single letter recognition? — Will, Dec 11 '16 at 14:20

Ganpat · Answer 1 · 2018-01-20T13:10:54.967

1

Check the Answer : https://stackoverflow.com/a/42925643/1637953

Declare String variable to hold recognized word.

Create a Timer at the beginning of the audio session:
strWords = "" var timer = NSTimer.scheduledTimerWithTimeInterval(2, target: self, selector: "didFinishTask", userInfo: nil, repeats: false)
In block of recognitionTaskWithRequest add below code:
strWords = result.bestTranscription.formattedString
If timer expired and didFinishTalk called, then:
if strWords == "" { timer.invalidate() timer = NSTimer.scheduledTimerWithTimeInterval(2, target: self, selector: "didFinishTalk", userInfo: nil, repeats: false) } else { // do your stuff using "strWord" }

edited Jan 20 '18 at 13:10

answered Jan 20 '18 at 13:02

Ganpat

768
3
15
30

Thanks for the comment on this old of a thread. In my testing, due to the fact that so many letters sound similar and also the fact that it was going to be a child speaking, I abandoned trying to recognize single letters. I appreciate the help though. – Will Jan 22 '18 at 15:52

Olympiloutre · Answer 2 · 2023-02-27T16:04:44.907

Eventhough the thread is old, I might have some good results to share for anyone who would pass by.

The "trick" I am using is to actually make a letter correspond to a "word", or something close:

recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
guard let recognitionRequest = recognitionRequest else { fatalError("Unable to create a SFSpeechAudioBufferRecognitionRequest object") }
recognitionRequest.shouldReportPartialResults = true

// Associate a letter to a word or an onomatopoeia
var letters = ["Hey": "A",
               "Bee": "B",
               "See": "C",
               "Dee": "D",
               "He": "E",
               "Eff": "F",
               "Gee": "G",
               "Atch": "H"]

// This tells the speech recognition to focus on those words
recognitionRequest.contextualStrings = letters.key

Then, when receiving the audio in recognitionTask, we access the dictionary to detect which letter the word is associated to.

recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error  in
    var isFinal = false
    
    if let result = result {
        isFinal = result.isFinal
        
        let bestTranscription = result.bestTranscription
  
        // extract the confidence the recognizer has in this word 
        let confidence = bestTranscription.segments.isEmpty ? -1 : bestTranscription.segments[0].confidence
        
        print("Best \(result.bestTranscription.formattedString) - Confidence: \(confidence)")
        
        // Only keep results with some confidence 
        if confidence > 0 {
            
            // If the transcription matches one of our keys we can retrieve the letter
            if letters.key.map({ $0.lowercased() }) .contains(result.bestTranscription.formattedString.lowercased()) {
                let detected = result.bestTranscription.formattedString
                print("Letter: \(letters[detected])")
                
                // And stop recording afterwards 
                self.stopRecording()
            }
        }
    }

    if error != nil || isFinal {
        // The rest of the boilerplate from Apple's doc sample probect... 
    }
}

Notes:

its important to set shouldReportPartialResults to true, otherwise it waits quite a while before sending the result
After some tests, it seems that when you are setting recognitionRequest.contextualStrings, the confidence tends to skyrocket when it recognizes one of those strings. You could probably increase the confidence treshold to 0.3 or 0.4
It might takes a long time to finish the alphabet, since sometimes one word gets recognize for another. It took a lot of trial and error to get good results for the first 8 letters (ex: I tried "age" for "H", but it kept being recognized as "Hey", i.e. "A")

Some results:

Best Gee - Confidence: 0.0  
// ... after a while, half a second maybe ...  
Best Gee - Confidence: 0.864  
Found G

(Apple's sample project to test it out: https://developer.apple.com/documentation/speech/recognizing_speech_in_live_audio)

Single spoken letters with SFSpeechRecognizer?

2 Answers2