SFSpeechRecognizer - detect end of utterance

Question

I am hacking a little project using iOS 10 built-in speech recognition. I have working results using device's microphone, my speech is recognized very accurately.

My problem is that recognition task callback is called for every available partial transcription, and I want it to detect person stopped talking and call the callback with isFinal property set to true. It is not happening - app is listening indefinitely.

Is SFSpeechRecognizer ever capable of detecting end of sentence?

Here's my code - it is based on example found on the Internets, it is mostly a boilerplate needed to recognize from microphone source. I modified it by adding recognition taskHint. I also set shouldReportPartialResults to false, but it seems it has been ignored.

    func startRecording() {

    if recognitionTask != nil {
        recognitionTask?.cancel()
        recognitionTask = nil
    }

    let audioSession = AVAudioSession.sharedInstance()
    do {
        try audioSession.setCategory(AVAudioSessionCategoryRecord)
        try audioSession.setMode(AVAudioSessionModeMeasurement)
        try audioSession.setActive(true, with: .notifyOthersOnDeactivation)
    } catch {
        print("audioSession properties weren't set because of an error.")
    }

    recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
    recognitionRequest?.shouldReportPartialResults = false
    recognitionRequest?.taskHint = .search

    guard let inputNode = audioEngine.inputNode else {
        fatalError("Audio engine has no input node")
    }

    guard let recognitionRequest = recognitionRequest else {
        fatalError("Unable to create an SFSpeechAudioBufferRecognitionRequest object")
    }

    recognitionRequest.shouldReportPartialResults = true

    recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest, resultHandler: { (result, error) in

        var isFinal = false

        if result != nil {
            print("RECOGNIZED \(result?.bestTranscription.formattedString)")
            self.transcriptLabel.text = result?.bestTranscription.formattedString
            isFinal = (result?.isFinal)!
        }

        if error != nil || isFinal {
            self.state = .Idle

            self.audioEngine.stop()
            inputNode.removeTap(onBus: 0)

            self.recognitionRequest = nil
            self.recognitionTask = nil

            self.micButton.isEnabled = true

            self.say(text: "OK. Let me see.")
        }
    })

    let recordingFormat = inputNode.outputFormat(forBus: 0)
    inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer, when) in
        self.recognitionRequest?.append(buffer)
    }

    audioEngine.prepare()

    do {
        try audioEngine.start()
    } catch {
        print("audioEngine couldn't start because of an error.")
    }

    transcriptLabel.text = "Say something, I'm listening!"

    state = .Listening
}

score 28 · Answer 1 · edited Jan 20 '18 at 13:03

28

It seems that isFinal flag doesn't became true when user stops talking as expected. I guess this is a wanted behaviour by Apple, because the event "User stops talking" is an undefined event.

I believe that the easiest way to achieve your goal is to do the following:

You have to estabilish an "interval of silence". That means if the user doesn't talk for a time greater than your interval, he has stopped talking (i.e. 2 seconds).
Create a Timer at the beginning of the audio session:

var timer = NSTimer.scheduledTimerWithTimeInterval(2, target: self, selector: "didFinishTalk", userInfo: nil, repeats: false)

when you get new transcriptions in recognitionTaskinvalidate and restart your timer

timer.invalidate() timer = NSTimer.scheduledTimerWithTimeInterval(2, target: self, selector: "didFinishTalk", userInfo: nil, repeats: false)
if the timer expires this means the user doesn't talk from 2 seconds. You can safely stop Audio Session and exit

edited Jan 20 '18 at 13:03

Ganpat

768
3
15
30

answered Mar 21 '17 at 11:21

Joe Aspara

1,137
1
13
26

2

Why Apple found this tuff? Only need to implement useful delegate method. – Ganpat Jan 20 '18 at 11:12
3

FWIW, instead of invalidating and re-creating the timer, it's possible to delay it by updating its `fireDate` property. – Tom Harrington Apr 24 '18 at 14:50
5

looks like a major fail to me. using these scheduled timers feels like a hack – user798719 Aug 14 '18 at 06:55
Some synchronization issue will follow along when you do it in this path. Assuming we have some internet connectivity issue either on client side or server side, it's more better to listen to an audio rather than resetting the timer on the utterance callback. – mr5 Dec 05 '18 at 07:02
If Apple had half a brain they would expose some `seconds_silence_to_trigger_did_finish_utterance_callback` variable & internalise this timer machinery. – P i Jul 22 '19 at 04:57
I am struggling to implement this solution here: https://stackoverflow.com/questions/57148596/implementing-user-stopped-speaking-notification-for-sfspeechrecognizer – P i Jul 22 '19 at 14:52

Zebra · Answer 2 · 2018-09-03T09:23:24.630

4

Based on my test on iOS10, when shouldReportPartialResults is set to false, you have to wait 60 seconds to get the result.

edited Sep 03 '18 at 09:23

answered Sep 03 '18 at 08:33

Zebra

103
1
7

The max length for a recognition task is one minute, which is why it's ending for you. You can stop the task earlier though by calling `recognitionRequest.endAudio()`. – Jay Whitsitt Jun 09 '22 at 00:41

Alan · Answer 3 · 2018-04-24T14:55:27.023

2

I am using Speech to text in an app currently and it is working fine for me. My recognitionTask block is as follows:

recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest, resultHandler: { (result, error) in
        var isFinal = false

        if let result = result, result.isFinal {
            print("Result: \(result.bestTranscription.formattedString)")
            isFinal = result.isFinal
            completion(result.bestTranscription.formattedString, nil)
        }

        if error != nil || isFinal {
            self.audioEngine.stop()
            inputNode.removeTap(onBus: 0)

            self.recognitionRequest = nil
            self.recognitionTask = nil
            completion(nil, error)
        }
    })

edited Apr 24 '18 at 14:55

answered Apr 24 '18 at 14:47

Alan

1,132
7
15

1

Isn't that kind of the opposite of the question? The question is about speech recognition, this seems to cover speech synthesis. – Tom Harrington Apr 24 '18 at 14:51
Whoops indeed. My answer has been edited. Been working on something with speech-to-text and text-to-speech and just got confused. Read the OP's comment about a delegate and instantly thought of that. – Alan Apr 24 '18 at 14:53
How does `completion()` work in 2 places? – David May 12 '21 at 01:55

score 0 · Answer 4 · edited Jul 31 '20 at 07:37

if result != nil {
    self.timerDidFinishTalk.invalidate()
    self.timerDidFinishTalk = Timer.scheduledTimer(timeInterval: TimeInterval(self.listeningTime), target: self, selector:#selector(self.didFinishTalk), userInfo: nil, repeats: false)

    let bestString = result?.bestTranscription.formattedString

    self.fullsTring =  bestString!.trimmingCharacters(in: .whitespaces)
    self.st = self.fullsTring
  }

Here self.listeningTime is the time after which you want to stop after getting end of the utterance.

score 0 · Answer 5 · answered May 11 '22 at 16:06

I have a different approach that I find far more reliable in determining when the recognitionTask is done guessing: the confidence score.

When shouldReportPartialResults is set to true, the partial results will have a confidence score of 0.0. Only the final guess will come back with a score over 0.

recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in

    if let result = result {
        let confidence = result.bestTranscription.segments[0].confidence
        print(confidence)
        self.transcript = result.bestTranscription.formattedString
    }

}

The segments array above contains each word in the transcription. 0 is the safest index to examine, so I tend to use that one.

How you use it is up to you, but if all you want to do is know when the guesser is done guessing, you can just call:

let myIsFinal = confidence > 0.0 ? true : false

You can also look at the score (100.0 is totally confident) and group responses into groups of low -> high confidence guesses as well if that helps your application.

SFSpeechRecognizer - detect end of utterance

5 Answers5

Linked