4

I am trying to decode a raw H264 stream using VideoToolbox APIs in Swift (macOS).

In the viewDidLoad() I setup my display layer and CMTimeBase as so:

self.view.wantsLayer = true

self.VideoLayer = AVSampleBufferDisplayLayer()
self.VideoLayer.frame = self.view.bounds
self.view.layer?.addSublayer(self.VideoLayer)

var _CMTimebasePointer: CMTimebase? = nil
let status = CMTimebaseCreateWithMasterClock(
    allocator: kCFAllocatorDefault,
    masterClock: CMClockGetHostTimeClock(),
    timebaseOut: &_CMTimebasePointer)

self.VideoLayer.controlTimebase = _CMTimebasePointer
CMTimebaseSetTime(
    self.VideoLayer.controlTimebase!,
    time: CMTime.zero);
CMTimebaseSetRate(
    self.VideoLayer.controlTimebase!,
    rate: 1.0);

Then I read my H264 file as raw bytes and parse into separate NALUs. (I cross-checked with NALU parsers in other projects, and my NALU parser is correct, but if you think I should post it's code here, leave a comment and I'll edit my question :) )

This is how I process each NALU (I basically set the NALU length in the first 4 bytes (to convert to avcC format), and for SPS & PPS NALUs, I ignore the first 4 bytes.):

func decodeFrame(_ videoPacket: inout VideoPacket)
{
    // replace start code with nal size
    var biglen = CFSwapInt32HostToBig(UInt32(videoPacket.count - 4)) // NALU length doesn't contain the first 4 size bytes
    memcpy(&videoPacket, &biglen, 4)
    let nalType = videoPacket[4] & 0x1F
    switch nalType
    {
        case 0x05:
//                print("Nal type is IDR frame")
            // inside this I create the format description and decompression session
            createDecompressionSession()
            decodeVideoPacket(videoPacket)
        case 0x07:
//                print("Nal type is SPS")
            spsSize = videoPacket.count - 4
            sps = Array(videoPacket[4..<videoPacket.count])
        case 0x08:
//                print("Nal type is PPS")
            ppsSize = videoPacket.count - 4
            pps = Array(videoPacket[4..<videoPacket.count])
        default:
//                print("Nal type is B/P frame: \(nalType)")
            decodeVideoPacket(videoPacket)
            break;
    }
}

I then create the VideoFormatDescription like so:

let pointerSPS = UnsafePointer<UInt8>(spsData)
let pointerPPS = UnsafePointer<UInt8>(ppsData)

// make pointers array
let dataParamArray = [pointerSPS, pointerPPS]
let parameterSetPointers = UnsafePointer<UnsafePointer<UInt8>>(dataParamArray)

// make parameter sizes array
let sizeParamArray = [spsData.count, ppsData.count]
let parameterSetSizes = UnsafePointer<Int>(sizeParamArray)

let status = CMVideoFormatDescriptionCreateFromH264ParameterSets(
    allocator: kCFAllocatorDefault,
    parameterSetCount: 2,
    parameterSetPointers: parameterSetPointers,
    parameterSetSizes: parameterSetSizes,
    nalUnitHeaderLength: 4,
    formatDescriptionOut: &self.VideoFormatDescription) // class variable

And I make the VTDecompressionSession like so:

let decoderParameters = NSMutableDictionary()
let destinationPixelBufferAttributes = NSMutableDictionary()
destinationPixelBufferAttributes.setValue(
    NSNumber(value: kCVPixelFormatType_32ARGB), // I've tried various values here to no avail...
    forKey: kCVPixelBufferPixelFormatTypeKey as String
)

var outputCallback = VTDecompressionOutputCallbackRecord()
outputCallback.decompressionOutputCallback = decompressionSessionDecodeFrameCallback
outputCallback.decompressionOutputRefCon = UnsafeMutableRawPointer(Unmanaged.passUnretained(self).toOpaque())

let status = VTDecompressionSessionCreate(
    allocator: kCFAllocatorDefault,
    formatDescription: videoDescription,
    decoderSpecification: decoderParameters,
    imageBufferAttributes: destinationPixelBufferAttributes,
    outputCallback: &outputCallback,
    decompressionSessionOut: &self.DecompressionSession)

Then, this is how I decode each frame:

func decodeVideoPacket(_ videoPacket: VideoPacket)
{
    let bufferPointer = UnsafeMutablePointer<UInt8>(mutating: videoPacket)
    var blockBuffer: CMBlockBuffer?
    var status = CMBlockBufferCreateWithMemoryBlock(
        allocator: kCFAllocatorDefault,
        memoryBlock: bufferPointer,
        blockLength: videoPacket.count,
        blockAllocator: kCFAllocatorNull,
        customBlockSource: nil,
        offsetToData: 0,
        dataLength: videoPacket.count,
        flags: 0,
        blockBufferOut: &blockBuffer)
    if status != noErr
    {
        print("CMBlockBufferCreateWithMemoryBlock ERROR: \(status)")
        return
    }
    
    var sampleBuffer: CMSampleBuffer?
    let sampleSizeArray = [videoPacket.count]
    
    let frameFPS = Double(1) / Double(60)
    let tval = Double(frameFPS * Double(self.frameCount))
    let presentationTime = CMTimeMakeWithSeconds(tval, preferredTimescale: 1000)
    var info = CMSampleTimingInfo(
        duration: CMTimeMakeWithSeconds(frameFPS, preferredTimescale: 1000),
        presentationTimeStamp: presentationTime,
        decodeTimeStamp: presentationTime)
    self.frameCount += 1
    
    status = CMSampleBufferCreateReady(
        allocator: kCFAllocatorDefault,
        dataBuffer: blockBuffer,
        formatDescription: self.VideoFormatDescription,
        sampleCount: 1,
        sampleTimingEntryCount: 1,
        sampleTimingArray: &info,
        sampleSizeEntryCount: 1,
        sampleSizeArray: sampleSizeArray,
        sampleBufferOut: &sampleBuffer)
    if status != noErr
    {
        print("CMSampleBufferCreateReady ERROR: \(status)")
        return
    }
    
    guard let buffer = sampleBuffer
    else
    {
        print("Could not unwrap sampleBuffer!")
        return
    }
    
    if self.VideoLayer.isReadyForMoreMediaData
    {
        self.VideoLayer?.enqueue(buffer)
        self.VideoLayer.displayIfNeeded()
    }
    
    
    if let session = self.DecompressionSession
    {
        var outputBuffer: CVPixelBuffer?

        status = VTDecompressionSessionDecodeFrame(
            session,
            sampleBuffer: buffer,
            flags: [],
            frameRefcon: &outputBuffer,
            infoFlagsOut: nil)
        if status != noErr
        {
            print("VTDecompressionSessionDecodeFrame ERROR: \(status)")
        }

        status = VTDecompressionSessionWaitForAsynchronousFrames(session)
        if status != noErr
        {
            print("VTDecompressionSessionWaitForAsynchronousFrames ERROR: \(status)")
        }
    }
}

Last, in the decode callback function, currently I just try to check if imageBuffer is nil or not, but it's always nil and the OSStatus is always set to noErr

private func decompressionSessionDecodeFrameCallback(
    _ decompressionOutputRefCon: UnsafeMutableRawPointer?,
    _ sourceFrameRefCon: UnsafeMutableRawPointer?,
    _ status: OSStatus,
    _ infoFlags: VTDecodeInfoFlags,
    _ imageBuffer: CVImageBuffer?,
    _ presentationTimeStamp: CMTime,
    _ presentationDuration: CMTime) -> Void
{
    print("status: \(status), image_nil?: \(imageBuffer == nil)")
}

Clearly, since the imageBuffer is nil, there is something wrong I assume...

(Also AVSampleBufferDisplayLayer doesn't render any image)

Can you guys please help me find what is wrong with my code or perhaps tell me how to dive deeper into finding out the VTDecompression error that might be happening but is hidden from me?

PS: let me know whatever might need explaining more in my code

user3339439
  • 95
  • 1
  • 7
  • 2
    Hey, welcome to the site! This is a *very* well written question, I commend you! – Alexander Mar 29 '21 at 14:39
  • 1
    You need to pass **all NALUs** you receive to the H.264 decoder including the SPS and PPS NALUs. (You also put the SPS and PPS into the decoder context, as you are doing.) And, it's possible for a frame of video to be coded in multiple NALUs. I think, but I'm not sure, your code assumes each VideoPacket contains just one NALU, so you only do the delimiter-to-length (AnnexB -to-avcC) conversion on the first one. And, I agree with @Alexander. Welcome! I hope to see more contributions from you here. – O. Jones Mar 30 '21 at 14:01

2 Answers2

2

I have some suggestions, which can help you.(removed comments, to create full answer)

  1. There is an outputCallback closure, which also has status: OSStatus you can check the error there as well:
/// This step is not necessary, because I'm using sample buffer layer to display it
/// this method generate gives you `CVPixelBuffer` if you want to manage displaying yourself
private var outputCallback: VTDecompressionOutputCallback = {
    (decompressionOutputRefCon: UnsafeMutableRawPointer?,
    sourceFrameRefCon: UnsafeMutableRawPointer?, status: OSStatus,
    infoFlags: VTDecodeInfoFlags, imageBuffer: CVPixelBuffer?,
    presentationTimeStamp: CMTime, duration: CMTime) in
    
    let selfPointer = Unmanaged<VideoStreamManager>.fromOpaque(decompressionOutputRefCon!).takeUnretainedValue()
    if status == noErr {
        debugPrint("===== ✅ Image successfully decompressed, OSStatus: \(status) =====")
    } else {
        debugPrint("===== ❌ Failed to decompress, OSStatus: \(status) =====")
    }
}
  1. Start code in NAL, it's not always 00 00 00 01(3 bytes), it can be 00 00 01(2 bytes), but you subscript always [4] byte

The Annex B specification solves this by requiring ‘Start Codes’ to precede each NALU. A start code is 2 or 3 0x00 bytes followed with a 0x01 byte. e.g. 0x000001 or 0x00000001.

Reference:

Let me know if this helps you.

vpoltave
  • 1,612
  • 3
  • 14
  • 31
  • Hi, thanks for your suggestions, for #1, I am checking in the outputCallback if OSStatus is noErr and it is always noErr (it's the last code snippet in my question), any other place I could check?. For #2, my NAL parser checks for both `00 00 01` and `00 00 00 01` start codes and 'normalizes' the `00 00 01` by adding extra **00** in front so every NALU starts with `00 00 00 01` (for easier processing afterwards). – user3339439 Mar 29 '21 at 10:01
  • Sorry, indeed you have `outputCallback` just didn't recognize it since you have a function and I have a property with a different name. – vpoltave Mar 29 '21 at 12:20
  • Do you succefully enter this `if self.VideoLayer.isReadyForMoreMediaData` if statement? – vpoltave Mar 29 '21 at 12:21
  • Yeah I do, I've even queued sample buffers into a local array inside my class from decode method, and then in `viewDidLoad`, called requestMediaDataWhenReady to feed those CMSampleBuffers into the display layer. They do get fed into the display layer, but I can't see any picture. https://developer.apple.com/documentation/avfoundation/avsamplebufferdisplaylayer/1387778-requestmediadatawhenready – user3339439 Mar 29 '21 at 20:07
  • Do you have a working example of this code on you that you could maybe share? – user3339439 Mar 29 '21 at 20:07
  • Hey, small update, I got it working! I'll try to post my solution soon. – user3339439 Mar 30 '21 at 09:24
  • @user3339439 nice, good for you. Will wait for update to see what was wrong – vpoltave Mar 30 '21 at 12:27
2

My problem was, although I was parsing each NALU correctly and converting each NALU to AVCC format to feed into AVSampleBufferDisplayLayer / VTDecompressor, each NALU was not the entire video frame. I stumbled on this random thread somewhere (can't find it now), but it described taking all NALUs that make up one video frame and combine them into one big NALU.

This looks like as follows:

NALU_length_header_1 = 4 byte big-endian NALU length value

NALU_1 = rest of the nalu data bytes (contains the NALU slice_header and video frame data I think)

Each NALU looks like = [NALU_length_header_1][NALU_1]

so when we combine multiple to make up one frame, it should look like: [NALU_length_header_1][NALU_1][NALU_length_header_2][NALU_2][NALU_length_header_3][NALU_3][NALU_length_header_4][NALU_4]

In my case, four NALUs made up one full video frame.

Once you have the NALUs combined, probably in some [UInt8] array type, this value can be used to create a BlockBuffer and then CMSampleBuffer, and passed to decoder / video layer.

There's two ways I found that can be used to detect which NALUs combined together make up a video frame. Both involve looking at the NALU slice header properties.

First, you can look at the property called frame_num, if any NALUs have the same frame_num value, combine the data into one 'big' NALU. (my encoder doesn't set this value, so I had to use first_mb_in_slice value)

Second, read the property called first_mb_in_slice. This property incremented as 0, 2040, 4080, 6120 over the span of four NALUs, it refers to the offset of the video frame data and we can use this to detect the NALUs that make up one video frame.

Ps: sorry if my answer was a little too wordy or confusing, I hope it helps!

user3339439
  • 95
  • 1
  • 7
  • Thank you, I've been working on this for days until I finally found your answer. Simply appending all the NALUs together with the length headers included is so simple that it didn't even occur to me.I'm just combining NALUs that have the same pts... I wonder if that's safe or I need to look at the properties you're referring to. – Ken Aspeslagh Jan 21 '23 at 00:54
  • FWIW, going by the timestamp did not work completely right. For some reason some of the NALs of the same image had different timestamps! – Ken Aspeslagh Jan 24 '23 at 03:42