How To Make iOS Speech-To-Text Persistent

Question

I am conducting initial research on a new potential product. Part of this product requires Speech-To-Text on both iPhones and iPads to remain on until the user turns it off. Upon using it myself, I noticed that it either automatically shuts off after 30 or so seconds, regardless of whether or not the user has stopped speaking, OR it shuts off after there have been a certain amount of questionable words from the speaker. In any case, this product requires it to remain on all of the time until explicitly told to stop. Has anybody worked with this before? And yes, I have tried a good search, I couldn't seem to find anything of substance, and especially anything written in the right language. Thanks friends!

score 8 · Accepted Answer · edited Apr 09 '22 at 17:39

8

import Speech

let recognizer = SFSpeechRecognizer()
let request = SFSpeechURLRecognitionRequest(url: audioFileURL)
#if targetEnvironment(simulator)
  request.requiresOnDeviceRecognition = /* only appears to work on device; not simulator */ false
#else
  request.requiresOnDeviceRecognition = /* only appears to work on device; not simulator */ true
#endif
recognizer?.recognitionTask(with: request, resultHandler: { (result, error) in
 print (result?.bestTranscription.formattedString)
})

The above code snippet, when run on a physical device will continuously ("persistently") transcribe audio using Apple's Speech Framework.

The magic line here is request.requiresOnDeviceRecognition = ...

If request.requiresOnDeviceRecognition is true and SFSpeechRecognizer#supportsOnDeviceRecognition is true, then the audio will continuously be transcribed until battery dies, user cancels transcription, or some other error/terminating condition occurs. This is at least true in my trials.

Docs:

https://developer.apple.com/documentation/speech/recognizing_speech_in_live_audio

edited Apr 09 '22 at 17:39

technoplato

3,293
21
33

answered Aug 02 '16 at 19:38

Ethan

1,905
2
21
50

Please don't just leave it at that! I've figured out that we need to `import Speech` - but what does the `url` signify? - the Xcode docs are silent... – Grimxn Aug 02 '16 at 20:53
Im assuming that the url parameter in the above code would be the location of some sort of audio file, whether it is online (like a youtube video) or even a hosted file from dropbox. Seems to be one way to analyze the input speech – Ethan Aug 03 '16 at 04:33
2

@Grimxn, That allows you to recognize the speech from a saved audio file, while `SFSpeechAudioBufferRecognitionRequest` allows you to recognize the speech coming from the microphone. – Iulian Onofrei Oct 18 '16 at 09:50
@iulianonofrei - add this as a second answer! I'll upvote it! – Grimxn Oct 18 '16 at 18:20
1

I don't think it's complete enough for an answer, you can upvote my comment so it will show up even if the comments list is truncated. – Iulian Onofrei Oct 18 '16 at 18:22
@lulian-Onofrei can I stop recording this continuos loop of recognising from microphone on 2-3 seconds of silence? – Rocky Balboa Feb 01 '17 at 12:55
3

This does not solve the problem when using speech coming from the microphone though. The session still stops after 30 seconds or so. – Joshua Vidamo Jun 27 '17 at 09:49

kuzdu · Answer 2 · 2018-07-01T17:16:04.607

I found here a tutorial that show your speech. But see the notes:

Apple limits recognition per device. The limit is not known, but you can contact Apple for more information. Apple limits recognition per app.

If you routinely hit limits, make sure to contact Apple, they can probably resolve it.

Speech recognition uses a lot of power and data.

Speech recognition only lasts about a minute at a time.

EDIT

This answer was for iOS 10. I expect the release of iOS 12 at October 2018 but Apple still says:

Plan for a one-minute limit on audio duration. Speech recognition can place a relatively high burden on battery life and network usage. In iOS 10, utterance audio duration is limited to about one minute, which is similar to the limit for keyboard-related dictation.

See: https://developer.apple.com/documentation/speech

There are no API changes in the Speech Framework for iOS 11 and 12. See all API changes and especially for iOS 12 in detail by Paul Hudson: iOS 12 APIs Diffs

So my answer should still be valid.

fromlucknow · Answer 3 · 2020-03-06T10:18:46.387

this will help you in autostart recording every 40 seconds even if you don't speak anything. If you speak something and then there is silence for 2 secs it will stop and didfinishtalk function is called.

@objc  func startRecording() {


    self.fullsTring = ""
    audioEngine.reset()

    if recognitionTask != nil {
        recognitionTask?.cancel()
        recognitionTask = nil

    }



    let audioSession = AVAudioSession.sharedInstance()
    do {
        try audioSession.setCategory(.record)
        try audioSession.setMode(.measurement)
        try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
        try audioSession.setPreferredSampleRate(44100.0)

        if audioSession.isInputGainSettable {
            let error : NSErrorPointer = nil

            let success = try? audioSession.setInputGain(1.0)

            guard success != nil else {
                print ("audio error")
                return
            }
            if (success != nil) {
                print("\(String(describing: error))")
            }
        }
        else {
            print("Cannot set input gain")
        }
    } catch {
        print("audioSession properties weren't set because of an error.")
    }
    recognitionRequest = SFSpeechAudioBufferRecognitionRequest()

    let inputNode = audioEngine.inputNode
    guard let recognitionRequest = recognitionRequest else {
        fatalError("Unable to create an SFSpeechAudioBufferRecognitionRequest object")
    } 

    recognitionRequest.shouldReportPartialResults = true 
    self.timer4 = Timer.scheduledTimer(timeInterval: TimeInterval(40), target: self, selector: #selector(againStartRec), userInfo: nil, repeats: false)

    recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest, resultHandler: { (result, error ) in  

        var isFinal = false  //8

        if result != nil {
            self.timer.invalidate()
            self.timer = Timer.scheduledTimer(timeInterval: TimeInterval(2.0), target: self, selector: #selector(self.didFinishTalk), userInfo: nil, repeats: false)

            let bestString = result?.bestTranscription.formattedString
            self.fullsTring = bestString!

     self.inputContainerView.inputTextField.text = result?.bestTranscription.formattedString

           isFinal = result!.isFinal

        }
        if error == nil{

        }
        if  isFinal {

            self.audioEngine.stop()
            inputNode.removeTap(onBus: 0)

            self.recognitionRequest = nil
            self.recognitionTask = nil
            isFinal = false

        }
        if error != nil{
            URLCache.shared.removeAllCachedResponses()



            self.audioEngine.stop()
                           inputNode.removeTap(onBus: 0)

                                    guard let task = self.recognitionTask else {
                                                      return
                                                  }
                                                  task.cancel()
                                                  task.finish()



        }
    })
    audioEngine.reset()
    inputNode.removeTap(onBus: 0)

  let recordingFormat = AVAudioFormat(standardFormatWithSampleRate: 44100, channels: 1)
    inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer, when) in
        self.recognitionRequest?.append(buffer)
    }

    audioEngine.prepare()

    do {
        try audioEngine.start()
    } catch {
        print("audioEngine couldn't start because of an error.")
    }


    self.hasrecorded = true



}


@objc func againStartRec(){

    self.inputContainerView.uploadImageView.setBackgroundImage( #imageLiteral(resourceName: "microphone") , for: .normal)
      self.inputContainerView.uploadImageView.alpha = 1.0
            self.timer4.invalidate()
    timer.invalidate()
           self.timer.invalidate()

            if ((self.audioEngine.isRunning)){

                self.audioEngine.stop()
                self.recognitionRequest?.endAudio()
                self.recognitionTask?.finish()


            }
   self.timer2 = Timer.scheduledTimer(timeInterval: 2, target: self, selector: #selector(startRecording), userInfo: nil, repeats: false)

}


@objc func didFinishTalk(){


    if self.fullsTring != ""{

     self.timer4.invalidate()
     self.timer.invalidate()
     self.timer2.invalidate()


          if ((self.audioEngine.isRunning)){

                 self.audioEngine.stop()
                 guard let task = self.recognitionTask else {
                    return
                 }
                 task.cancel()
                 task.finish()


             }



    }
}

score -1 · Answer 4 · answered Apr 10 '22 at 15:35


///
/// Code lightly adopted by  from https://developer.apple.com/documentation/speech/recognizing_speech_in_live_audio?language=swift
///
/// Modifications from original:
/// - Color of text changes every time a new "chunk" of text is transcribed
/// -- This was a feature I added while playing with my nephews. They loved it (2 and 6) (we kept saying rainbow)
/// - I added a bit of logic to scroll to the end of the text once new chunks were added
/// - I formatted the code using swiftformat
///

import Speech
import UIKit

public class ViewController: UIViewController, SFSpeechRecognizerDelegate {
  private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))!

  private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?

  private var recognitionTask: SFSpeechRecognitionTask?

  private let audioEngine = AVAudioEngine()

  @IBOutlet var textView: UITextView!

  @IBOutlet var recordButton: UIButton!

  let colors: [UIColor] = [.red, .orange, .yellow, .green, .blue, .purple]

  var colorIndex = 0

  override public func viewDidLoad() {
    super.viewDidLoad()

    textView.textColor = colors[colorIndex]
    // Disable the record buttons until authorization has been granted.
    recordButton.isEnabled = false
  }

  override public func viewDidAppear(_ animated: Bool) {
    super.viewDidAppear(animated)
    // Configure the SFSpeechRecognizer object already
    // stored in a local member variable.
    speechRecognizer.delegate = self

    // Asynchronously make the authorization request.
    SFSpeechRecognizer.requestAuthorization { authStatus in

      // Divert to the app's main thread so that the UI
      // can be updated.
      OperationQueue.main.addOperation {
        switch authStatus {
        case .authorized:
          self.recordButton.isEnabled = true

        case .denied:
          self.recordButton.isEnabled = false
          self.recordButton.setTitle("User denied access to speech recognition", for: .disabled)

        case .restricted:
          self.recordButton.isEnabled = false
          self.recordButton.setTitle("Speech recognition restricted on this device", for: .disabled)

        case .notDetermined:
          self.recordButton.isEnabled = false
          self.recordButton.setTitle("Speech recognition not yet authorized", for: .disabled)

        default:
          self.recordButton.isEnabled = false
        }
      }
    }
  }

  private func startRecording() throws {
    // Cancel the previous task if it's running.
    recognitionTask?.cancel()
    recognitionTask = nil

    // Configure the audio session for the app.
    let audioSession = AVAudioSession.sharedInstance()
    try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
    try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
    let inputNode = audioEngine.inputNode

    // Create and configure the speech recognition request.
    recognitionRequest = SFSpeechAudioBufferRecognitionRequest()

    ////////////////////////////////////////////////////////////////////////////////
    ////////////////////////////////////////////////////////////////////////////////
    /// The below lines are responsible for keeping the recording active longer
    /// than just short bursts. I've had the recording going all day in somewhat
    /// rudimentary attempts.
    ////////////////////////////////////////////////////////////////////////////////
    ////////////////////////////////////////////////////////////////////////////////
    if #available(iOS 13, *) {
      let supportsOnDeviceRecognition = speechRecognizer.supportsOnDeviceRecognition
      if !supportsOnDeviceRecognition {
        fatalError("On device transcription not supported on this device. It is safe to remove this error but I wanted to add it as a warning that you'd actually see.")
      }
      recognitionRequest!.requiresOnDeviceRecognition = /* only appears to work on device; not simulator */ supportsOnDeviceRecognition
    }

    guard let recognitionRequest = recognitionRequest else { fatalError("Unable to create a SFSpeechAudioBufferRecognitionRequest object") }
    recognitionRequest.shouldReportPartialResults = true

    // Create a recognition task for the speech recognition session.
    // Keep a reference to the task so that it can be canceled.
    recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in
      var isFinal = false

      if let result = result {
        // Update the text view with the results.
        self.colorIndex = (self.colorIndex + 1) % self.colors.count
        self.textView.text = result.bestTranscription.formattedString
        self.textView.textColor = self.colors[self.colorIndex]
        self.textView.scrollRangeToVisible(NSMakeRange(result.bestTranscription.formattedString.count - 1, 0))
        isFinal = result.isFinal
        print("Text \(result.bestTranscription.formattedString)")
      }

      if error != nil || isFinal {
        // Stop recognizing speech if there is a problem.
        self.audioEngine.stop()
        inputNode.removeTap(onBus: 0)

        self.recognitionRequest = nil
        self.recognitionTask = nil

        self.recordButton.isEnabled = true
        self.recordButton.setTitle("Start Recording", for: [])
      }
    }

    // Configure the microphone input.
    let recordingFormat = inputNode.outputFormat(forBus: 0)
    inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, _: AVAudioTime) in
      self.recognitionRequest?.append(buffer)
    }

    audioEngine.prepare()
    try audioEngine.start()

    // Let the user know to start talking.
    textView.text = "(Go ahead, I'm listening)"
  }

  // MARK: SFSpeechRecognizerDelegate

  public func speechRecognizer(_: SFSpeechRecognizer, availabilityDidChange available: Bool) {
    if available {
      recordButton.isEnabled = true
      recordButton.setTitle("Start Recording", for: [])
    } else {
      recordButton.isEnabled = false
      recordButton.setTitle("Recognition Not Available", for: .disabled)
    }
  }

  // MARK: Interface Builder actions

  @IBAction func recordButtonTapped() {
    if audioEngine.isRunning {
      audioEngine.stop()
      recognitionRequest?.endAudio()
      recordButton.isEnabled = false
      recordButton.setTitle("Stopping", for: .disabled)
    } else {
      do {
        try startRecording()
        recordButton.setTitle("Stop Recording", for: [])
      } catch {
        recordButton.setTitle("Recording Not Available", for: [])
      }
    }
  }
}

The above code snippet, when run on a physical device will continuously ("persistently") transcribe audio using Apple's Speech Framework.

The magic line here is request.requiresOnDeviceRecognition = ...

If request.requiresOnDeviceRecognition is true and SFSpeechRecognizer#supportsOnDeviceRecognition is true, then the audio will continuously be transcribed until battery dies, user cancels transcription, or some other error/terminating condition occurs. This is at least true in my trials.

Docs:

https://developer.apple.com/documentation/speech/recognizing_speech_in_live_audio

Notes:

I had originally attempted editing this answer [0] but wanted to add so much detail that I felt it completely hijacked the original answerer. I will be maintaining my own answer with ideally: An approach that translates this one into SwiftUI and also into the Composable Architecture (adopting their example [1]) as a canonical source of quick start for voice transcription on Apple Platforms.

0: https://stackoverflow.com/a/38729106/2441420

1: https://github.com/pointfreeco/swift-composable-architecture/tree/main/Examples/SpeechRecognition/SpeechRecognition

How To Make iOS Speech-To-Text Persistent

4 Answers4

Docs:

Docs:

Notes: