Sample accurate extraction of chunks of audio using AVFoundation

Question

Problem

I am looking to extract sample-accurate ranges of LPCM audio from audio tracks within video files. Currently, I'm looking to achieve this using AVAssetReaderTrackOutput against an AVAssetTrack delivered from reading a AVURLAsset.

Despite preparing and ensuring the asset is initialized using AVURLAssetPreferPreciseDurationAndTimingKey set to YES, seeking to a sample-accurate position within an asset seems to be inaccurate.

NSDictionary *options = @{ AVURLAssetPreferPreciseDurationAndTimingKey : @(YES) };
_asset = [[AVURLAsset alloc] initWithURL:fileURL options:options];

This manifests itself with e.g. variable bit-rate encoded AAC streams. While I know that VBR audio streams present a performance overhead in seeking accurately, I'm willing to pay this provided I am delivered accurate samples.

When using e.g. Extended Audio File Services and the ExtAudioFileRef APIs, I can achieve sample-accurate seeks and extraction of audio. Likewise with AVAudioFile, as this builds on top of ExtAudioFileRef.

The issue, however, is I would also like to extract audio from media containers that the audio-file only APIs reject, but which are supported in AVFoundation via AVURLAsset.

Method

A sample accurate time range for extraction is defined using CMTime and CMTimeRange, and set on the AVAssetReaderTrackOutput. Samples are then iteratively extracted.

-(NSData *)readFromFrame:(SInt64)startFrame
      requestedFrameCount:(UInt32)frameCount
{
    NSUInteger expectedByteCount = frameCount * _bytesPerFrame;
    NSMutableData *data = [NSMutableData dataWithCapacity:expectedByteCount];
    
    //
    // Configure Output
    //

    NSDictionary *settings = @{ AVFormatIDKey               : @( kAudioFormatLinearPCM ),
                                AVLinearPCMIsNonInterleaved : @( NO ),
                                AVLinearPCMIsBigEndianKey   : @( NO ),
                                AVLinearPCMIsFloatKey       : @( YES ),
                                AVLinearPCMBitDepthKey      : @( 32 ),
                                AVNumberOfChannelsKey       : @( 2 ) };

    AVAssetReaderOutput *output = [[AVAssetReaderTrackOutput alloc] initWithTrack:_track outputSettings:settings];

    CMTime startTime    = CMTimeMake( startFrame, _sampleRate );
    CMTime durationTime = CMTimeMake( frameCount, _sampleRate );
    CMTimeRange range   = CMTimeRangeMake( startTime, durationTime );

    //
    // Configure Reader
    //

    NSError *error = nil;
    AVAssetReader *reader = [[AVAssetReader alloc] initWithAsset:_asset error:&error];

    if( !reader )
    {
        fprintf( stderr, "avf : failed to initialize reader\n" );
        fprintf( stderr, "avf : %s\n%s\n", error.localizedDescription.UTF8String, error.localizedFailureReason.UTF8String );
        exit( EXIT_FAILURE );
    }

    [reader addOutput:output];
    [reader setTimeRange:range];
    BOOL startOK = [reader startReading];

    NSAssert( startOK && reader.status == AVAssetReaderStatusReading, @"Ensure we've started reading." );

    NSAssert( _asset.providesPreciseDurationAndTiming, @"We expect the asset to provide accurate timing." );

    //
    // Start reading samples
    //

    CMSampleBufferRef sample = NULL;
    while(( sample = [output copyNextSampleBuffer] ))
    {
        CMTime presentationTime = CMSampleBufferGetPresentationTimeStamp( sample );
        if( data.length == 0 )
        {
            // First read - we should be at the expected presentation time requested.
            int32_t comparisonResult = CMTimeCompare( presentationTime, startTime );
            NSAssert( comparisonResult == 0, @"We expect sample accurate seeking" );
        }

        CMBlockBufferRef buffer = CMSampleBufferGetDataBuffer( sample );

        if( !buffer )
        {
            fprintf( stderr, "avf : failed to obtain buffer" );
            exit( EXIT_FAILURE );
        }

        size_t lengthAtOffset = 0;
        size_t totalLength = 0;
        char *bufferData = NULL;

        if( CMBlockBufferGetDataPointer( buffer, 0, &lengthAtOffset, &totalLength, &bufferData ) != kCMBlockBufferNoErr )
        {
            fprintf( stderr, "avf : failed to get sample\n" );
            exit( EXIT_FAILURE );
        }

        if( bufferData && lengthAtOffset )
        {
            [data appendBytes:bufferData length:lengthAtOffset];
        }

        CFRelease( sample );
    }

    NSAssert( reader.status == AVAssetReaderStatusCompleted, @"Completed reading" );

    [output release];
    [reader release];

    return [NSData dataWithData:data];
}

Notes

The presentation time that CMSampleBufferGetPresentationTimeStamp gives me seems to match what I sought after - but as it seems inaccurate, then I have no chance to correct and align the samples I retrieve.

Any thoughts on how to do this?

Alternatively, is there a way to adapt AVAssetTrack to be used by AVAudioFile or ExtAudioFile?

Is it possible to access the audio track via AudioFileOpenWithCallbacks?

Is it possible to get at the audio stream from a video container in a different manner in macOS?

It should be noted that at times AVFoundation delivers fewer samples than are necessary to sufficiently satisfy the `durationTime` requirement. It isn't a problem to e.g. have a `durationTime` of `kCMTimePositiveInfinity` and simply read enough samples as required … it's the initial seek that is problematic. — Dan, Nov 06 '17 at 06:33

hotpaw2 · Answer 1 · 2017-11-15T00:06:35.477

4

One procedure that works is to use AVAssetReader, to read your compressed AV file, in conjunction with AVAssetWriter, to write a new raw LPCM file of the audio samples. Then one can quickly index through this new PCM file (or memory mapped array, if necessary) to extract exact sample-accurate ranges, without incurring VBR per-packet decoding size anomalies or depending on iOS CMTimeStamp algorithms outside one's control.

This may not be the most time or memory efficient procedure, but it works.

edited Nov 15 '17 at 00:06

answered Nov 06 '17 at 15:54

hotpaw2

70,107
14
90
153

1

It would definitely work — however I would really like to avoid intermediate complete output of the entire source audio track to memory/disk. Using e.g. `AVAssetExportSession` and writing the audio track without re-encoding to disk (pass-through), and then reading that using audio-file only APIs works, but is a costly step. – Dan Nov 06 '17 at 19:51

score 0 · Answer 2 · answered Nov 15 '17 at 10:52

I wrote another answer in which I incorrectly claimed AVAssetReader/AVAssetReaderTrackOutput did not do sample accurate seeking, they do, but it looks broken when your audio track is embedded inside a movie file, so you've found a bug. Congratulations!

The audio track dumped with a pass through AVAssetExportSession, as mentioned in a comment on @hotpaw2's answer, works fine, even when you seek on non-packet boundaries (you happened to be seeking on packet boundaries, the linked file's has 1024 frames per packet - seeking off packet boundaries, your diffs are no longer zero, but they are very, very small/non audible) .

I didn't find a workaround, so reconsider dumping the compressed track. Is it that costly? If you really don't want to do that, you can decode the raw packets yourself by passing nil outputSettings: to your AVAssetReaderOutput and running its output through an AudioQueue or (preferably?) an AudioConverter to get LPCM.

NB in this latter case, you will need to handle rounding up to packet boundaries when seeking.

Sample accurate extraction of chunks of audio using AVFoundation

Problem

Method

Notes

2 Answers2