3

am trying to use SFSpeechRecognizer for speech to text, after speaking a welcome message to the user via AVSpeechUtterance. But randomly, the speech recognition does not start(after speaking the welcome message) and it throws the error message below.

[avas] ERROR: AVAudioSession.mm:1049: -[AVAudioSession setActive:withOptions:error:]: Deactivating an audio session that has running I/O. All I/O should be stopped or paused prior to deactivating the audio session.

It works few times. Am not clear on why is it not working consistently.

I tried the solutions mentioned in other SO posts, where it mentions to check if there are audio players running. I added that check in the speech to text part of the code. It returns false (i.e. no other audio player is running) But still the speech to text does not start listening for the user speech. Can you pls guide me on what is going wrong.

Am testing on iPhone 6 running iOS 10.3

Below are code snippets used:

TextToSpeech:

- (void) speak:(NSString *) textToSpeak {
    [[AVAudioSession sharedInstance] setActive:NO withOptions:0 error:nil];
    [[AVAudioSession sharedInstance] setCategory:AVAudioSessionCategoryPlayback
      withOptions:AVAudioSessionCategoryOptionDuckOthers error:nil];

    [synthesizer stopSpeakingAtBoundary:AVSpeechBoundaryImmediate];

    AVSpeechUtterance* utterance = [[AVSpeechUtterance new] initWithString:textToSpeak];
    utterance.voice = [AVSpeechSynthesisVoice voiceWithLanguage:locale];
    utterance.rate = (AVSpeechUtteranceMinimumSpeechRate * 1.5 + AVSpeechUtteranceDefaultSpeechRate) / 2.5 * rate * rate;
    utterance.pitchMultiplier = 1.2;
    [synthesizer speakUtterance:utterance];
}

- (void)speechSynthesizer:(AVSpeechSynthesizer*)synthesizer didFinishSpeechUtterance:(AVSpeechUtterance*)utterance {
    //Return success message back to caller

    [[AVAudioSession sharedInstance] setActive:NO withOptions:0 error:nil];
    [[AVAudioSession sharedInstance] setCategory:AVAudioSessionCategoryAmbient
      withOptions: 0 error: nil];
    [[AVAudioSession sharedInstance] setActive:YES withOptions: 0 error:nil];
}

Speech To Text:

- (void) recordUserSpeech:(NSString *) lang {
    NSLocale *locale = [[NSLocale alloc] initWithLocaleIdentifier:lang];
    self.sfSpeechRecognizer = [[SFSpeechRecognizer alloc] initWithLocale:locale];
    [self.sfSpeechRecognizer setDelegate:self];

    NSLog(@"Step1: ");
    // Cancel the previous task if it's running.
    if ( self.recognitionTask ) {
        NSLog(@"Step2: ");
        [self.recognitionTask cancel];
        self.recognitionTask = nil;
    }

    NSLog(@"Step3: ");
    [self initAudioSession];

    self.recognitionRequest = [[SFSpeechAudioBufferRecognitionRequest alloc] init];
    NSLog(@"Step4: ");

    if (!self.audioEngine.inputNode) {
        NSLog(@"Audio engine has no input node");
    }

    if (!self.recognitionRequest) {
        NSLog(@"Unable to created a SFSpeechAudioBufferRecognitionRequest object");
    }

    self.recognitionTask = [self.sfSpeechRecognizer recognitionTaskWithRequest:self.recognitionRequest resultHandler:^(SFSpeechRecognitionResult *result, NSError *error) {

        bool isFinal= false;

        if (error) {
            [self stopAndRelease];
            NSLog(@"In recognitionTaskWithRequest.. Error code ::: %ld, %@", (long)error.code, error.description);
            [self sendErrorWithMessage:error.localizedFailureReason andCode:error.code];
        }

        if (result) {

            [self sendResults:result.bestTranscription.formattedString];
            isFinal = result.isFinal;
        }

        if (isFinal) {
            NSLog(@"result.isFinal: ");
            [self stopAndRelease];
            //return control to caller
        }
    }];

    NSLog(@"Step5: ");

    AVAudioFormat *recordingFormat = [self.audioEngine.inputNode outputFormatForBus:0];

    [self.audioEngine.inputNode installTapOnBus:0 bufferSize:1024 format:recordingFormat block:^(AVAudioPCMBuffer * _Nonnull buffer, AVAudioTime * _Nonnull when) {
        //NSLog(@"Installing Audio engine: ");
        [self.recognitionRequest appendAudioPCMBuffer:buffer];
    }];

    NSLog(@"Step6: ");

    [self.audioEngine prepare];
    NSLog(@"Step7: ");
    NSError *err;
    [self.audioEngine startAndReturnError:&err];
}
- (void) initAudioSession
{
    AVAudioSession *audioSession = [AVAudioSession sharedInstance];
    [audioSession setCategory:AVAudioSessionCategoryRecord error:nil];
    [audioSession setMode:AVAudioSessionModeMeasurement error:nil];
    [audioSession setActive:YES withOptions:AVAudioSessionSetActiveOptionNotifyOthersOnDeactivation error:nil];
}

-(void) stopAndRelease
{
    NSLog(@"Invoking SFSpeechRecognizer stopAndRelease: ");
    [self.audioEngine stop];
    [self.recognitionRequest endAudio];
    [self.audioEngine.inputNode removeTapOnBus:0];
    self.recognitionRequest = nil;
    [self.recognitionTask cancel];
    self.recognitionTask = nil;
}

Regarding the logs added, am able to see all logs till "Step7" printed.

When debugging the code in the device, it consistently triggers break at the below lines (I have exception breakpoints set) though, continue keeps on with the execution. It however happens same way during few successful executions as well.

AVAudioFormat *recordingFormat = [self.audioEngine.inputNode outputFormatForBus:0];

[self.audioEngine prepare];

csharpnewbie
  • 789
  • 2
  • 12
  • 33

1 Answers1

2

The reason is audio didn't completely finish, when -speechSynthesizer:didFinishSpeechUtterance: was called, therefore you get such kind of error trying to call setActive:NO. You cant deactivate AudioSession or change any settings during I/O is running. Workaround: wait for several ms (how long read below) and then perform AudioSession deactivation and stuff.

A few words about audio playing completion.

That might seem weird at first glance, but I've spent tones of time to research this issue. When you put last sound chunk to device output you have only approximate timing when it actually will be completed. Look at the AudioSession property ioBufferDuration:

The audio I/O buffer duration is the number of seconds for a single audio input/output cycle. For example, with an I/O buffer duration of 0.005 s, on each audio I/O cycle:

  • You receive 0.005 s of audio if obtaining input.
  • You must provide 0.005 s of audio if providing output.

The typical maximum I/O buffer duration is 0.93 s (corresponding to 4096 sample frames at a sample rate of 44.1 kHz). The minimum I/O buffer duration is at least 0.005 s (256 frames) but might be lower depending on the hardware in use.

So, we can interpret this value as the one chunk playback time. But you still have a small non-calculated duration between this timeline and actual audio playing completion (hardware delay). I would say you need wait about ioBufferDuration * 1000 + delay ms for being sure audio playing complete (ioBufferDuration * 1000 - coz it is duration in seconds), where delay is some quite small value.

More over seems like even Apple developers are also not pretty sure about audio completion time. Quick look at the new audio class AVAudioPlayerNode and func scheduleBuffer(_ buffer: AVAudioPCMBuffer, completionHandler: AVFoundation.AVAudioNodeCompletionHandler? = nil):

@param completionHandler called after the buffer has been consumed by the player or the player is stopped. may be nil.

@discussion Schedules the buffer to be played following any previously scheduled commands. It is possible for the completionHandler to be called before rendering begins or before the buffer is played completely.

You can read more about audio processing in Understanding the Audio Unit Render Callback Function (AudioUnit is low-level API that provides fasten access to I/O data).

Asya
  • 279
  • 2
  • 7
  • Thanks for your response. I have the issue when recognizing user speech(i.e in speech to text part). The audio speakout just works fine. So do you mean to say, to add delay in 'initAudioSession' in the SpeechToText process? Also, as mentioned, I added checks to see if there are audio players running. I added that check in the speech to text part of the code. It returns false (i.e. no other audio player is running) But still the speech to text does not start listening for the user speech. Also, I run my code in debugging mode, which naturally adds few seconds of delay too. – csharpnewbie Apr 25 '17 at 13:08
  • Is there anyway to figure out when speech utterance has actually finished? I was assuming '-speechSynthesizer: didFinishSpeechUtterance:' will be called, when it speech utterance has actually finished. Is that not the case? – csharpnewbie Apr 25 '17 at 13:09
  • It's not about "other audio is running", it's your audio is still running. Yes, '-speechSynthesizer: didFinishSpeechUtterance:' is called, when audio finish playing, but not exactly. I've try to explain above, that there is a small delay, which can't be heard, but it's still exist. It's about hardware implementation. The problem is not in the Speech to text part, it's in AudioSession deactivation. Try to add `dispatch after` inside didFinishSpeechUtterance finish callback and put `AudioSession` stuff to it. – Asya Apr 25 '17 at 13:36
  • I tried to put it in a NStimer, trigerred after 2 s, but I still have same issue. It worked once and got the issue again, in the next run. I modified didFinishSpeechUtterance, and put the below statements to execute in a NSTimer, after 2 S. [[AVAudioSession sharedInstance] setActive:NO withOptions:0 error:nil]; [[AVAudioSession sharedInstance] setCategory:AVAudioSessionCategoryAmbient withOptions: 0 error: nil]; [[AVAudioSession sharedInstance] setActive:YES withOptions: 0 error:nil]; – csharpnewbie Apr 25 '17 at 17:13
  • After the speech to text process starts, I see the 'Step 7 log' and after that see the error - [avas] ERROR: AVAudioSession.mm:1049: -[AVAudioSession setActive:withOptions:error:]: Deactivating an audio session that has running I/O. All I/O should be stopped or paused prior to deactivating the audio session. After sometime speech recognition stops itself, and I see the below error as well. Error Domain=kAFAssistantErrorDomain Code=203 "Corrupt" UserInfo={NSUnderlyingError=0x14651450 {Error Domain=SiriSpeechErrorDomain Code=102 "(null)"}, NSLocalizedDescription=Corrupt} – csharpnewbie Apr 25 '17 at 17:36
  • you had mentioned to set delay time as (ioBufferDuration * 1000 - coz it is duration in seconds). I did a print for [[AVAudioSession sharedInstance] IOBufferDuration] and it comes as 0.023220, which is only 23ms. It looks like I need to set a very small delay then? So even after using delay of 2S with NStimer, am still running into the same problem. Pls help. – csharpnewbie Apr 25 '17 at 18:56
  • Which line of code are you getting I/O error? And when you call `recordUserSpeech` method? I like `AudioSession` tricky-task, so I gonna do my best:) – Asya Apr 26 '17 at 05:28
  • From UI, I call the utterance, to speak out my welcome message, and upon receiving the delegate response back from didFinishSpeechUtterance, I call the recordUserSpeech. As I mentioned, it worked once yesterday as well, but did not work at all, after that, for several instances. – csharpnewbie Apr 26 '17 at 12:26
  • am not able to nail down which line of code is giving error, since we initialize audio session, even for speech to text. But as mentioned, all my lines of log print out fine in recognizeText method and then this error comes. So am guessing, it could be in [self.audioEngine startAndReturnError:&err]; in recognizeText, but error comes as null too. As mentioned, no other audio is playing too. – csharpnewbie Apr 26 '17 at 18:15
  • My first guess was wrong. I've implemented a small project based on your code. Nothing special with `speechSynthesizer:didFinishSpeechUtterance:`, but I've paid attention, you release audioEngine in `-stopAndRelease` method, which is called in `if (isFinal)` scope, which in turn is called about minute late speech end (look closely to log). So you have `AudioSesstion` error on second call `-speak:`. Be sure `-stopAndRelease` is called **before** you modify `AudioSession` again. May be `SFSpeechRecognitionTaskDelegate` gives you a better approach. – Asya Apr 27 '17 at 09:02
  • But it seems like Apple Speech.framework has a not quite good VAD (voice activity detection). – Asya Apr 27 '17 at 09:02
  • Some notes about you code: - look at your first `AudioSesstion` activation: you forget to call `setActive:YES` after setting category - using `AVAudioSessionModeMeasurement` category lead to second synthesized speech are playing with received output (very quit), are you sure this category is required? - dont call `setActive:**YES** withOptions:error:` it only makes sense for `setActive:**NO** withOptions:AVAudioSessionSetActiveOptionNotifyOthersOnDeactivationerror:` - use `AVAudioSessionCategoryOptionDefaultToSpeaker` category to be sure sound will be playing with Speaker output – Asya Apr 27 '17 at 09:04
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/142830/discussion-between-csharpnewbie-and-asya). – csharpnewbie Apr 27 '17 at 12:58
  • @Asya Do you have any idea about the question I have asked https://stackoverflow.com/questions/44762541/can-we-use-speech-framework-while-videoplayback-is-going. whether its possible or not , any suggestions will be appreciated. – The iCoder Jun 26 '17 at 15:20