How to extract SPS and PPS from RTMP stream (avc1 encoded)?

Question

I'm working on an extension to Node Media Server to save incoming streams to disk as MP4. For this conversion to MP4 I'm leaning heavily on the Apple QuickTime Movie Specification, The ISO/IEC 14496-14 Specification (which I discovered in the Rust-MP4 GitHub repository for free), and The HLS.js Source Code

I'm testing with a single video at the moment. Once this works I'll start experimenting with other videos. For my use case I only need to support H.264 video and AAC audio.

Currently when an RTMP connection is established, the first 3 packets I receive are consistently:

1 AMF metadata packet (RTMP cid = 6) containing information like video width, video height, bitrate, audio sample rate, etc
1 audio packet (RMTP cid = 4) containing 7 bytes of data. I assume this is the AAC config packet
1 video packet (RTMP cid = 5) containing 46 bytes of data. I assume this is the AVC config packet

When writing the MP4 moov atom, there are two places where I need to utilize additional information not located in the AMF metadata (and presumably located in these two config packets):

In the esds atom, The HLS.js source appends "config" data. I assume I just append the entire 7-byte payload from the audio config packet here
In the avcC atom, The HLS.js source append the "sps" and "pps" data. This is the root of my issue

Regarding the parsing of these 46 bytes, I found code in Node Media Server and HLS.js that seems to parse the same data. The difference between these two pieces of code is that Node Media Server expects an additional 13 bytes of data at the start of the packet. The packet I receive seems to contain these additional 13 bytes, so I simply follow their lead in extracting width, height, profile, compat, and level information. The 46 bytes in particular are:

[0x17, 0x00, 0x00, 0x00, 0x00, 0x01, 0x42, 0xc0, 0x1f, 0xff, 0xe1, 0x00, 0x19, 0x67, 0x42, 0xc0, 0x1f, 0xa6, 0x11, 0x02, 0x80, 0xbf, 0xe5, 0x84, 0x00, 0x00, 0x03, 0x00, 0x04, 0x00, 0x00, 0x03, 0x00, 0xc2, 0x3c, 0x60, 0xc8, 0x46, 0x01, 0x00, 0x05, 0x68, 0xc8, 0x42, 0x32, 0xc8]

Breaking this down for the bytes I can easily parse (prior to the use of Exponential Golomb encoding):

[
    0x17, // "frame type", specifies H.264 or HVEC
    0x00, 0x00, 0x00, 0x00, 0x01, // ignored. Reserved?
    0x42, // profile
    0xc0, // compat
    0x1f, // level
    0xff, // "info.nalu" (per Node Media Server source)
    0xe1, // "info.nb_sps" (per Node Media Server source)
    0x00, 0x19, // "nal size"
    // Above here are the bits exclusively seen by Node Media Server (specific to RTMP?)
    // Below here are the bits passed to HLS.js as "unit.data" (common to all AVC1 streams?):
    0x67, // "nal type"
    0x42, // profile (again?)
    0xc0, // compat (again?)
    0x1f, // level (again?)
    // Below here, data is not necessarily byte-aligned as Exponential Golomb encoding is used
    // ...
]

Now the problem I'm running into is during the creation of the moov atom (and the avcC atom, specifically) I need to know both the sps and the pps bytes. From the HLS.js source it looks like the sps may just be this video config packet minus the first 13 bytes. However how do I find the pps? Is pps actually the last few bytes of this packet, and I should split it somewhere? Will this be delivered in another packet? If two video packets are to be expected, is there some way I should differentiate them so I know which one is sps and which one is pps?

If I can figure out this last little bit, then I should be completely done writing the moov packet (after which point I just need to figure out the proper format for the mdat packet and I should have working code)

Update: For the record, I just checked the fourth packet being delivered to see if it might contain pps data. After reconnecting to the stream ~20 times the fourth packet was consistently a video packet (RTMP cid = 5), but the size of the packet ranged from 16000 bytes to 21000 bytes. I suspect this is legitimate video data.

Second Update: I just checked what the offset was in the video config packet when I finished parsing the SPS, and I was on byte 23 (0x84). It's therefore likely that the PPS is in fact at the end of this byte array, but I'm not sure how many bytes are delimiters / headers (NAL type, NAL length, etc) and how many bytes are the actual PPS.

The format is documented in many places on the internet. Including my previous answer here. https://stackoverflow.com/questions/24884827 — szatmary, Dec 11 '18 at 18:04
@szatmary Thank you for the good link. It definitely made the spec easier to understand. Although it's left me a bit confused with my own data. The video packet I have starts with `17 00 00 00 00 01`, which resembles the NALU start code (`00 00 00 01`) with 2 extra bytes prefixed. However if the following byte is a NALU type (`42 & 1f == 2`) then that would mean I'm looking at a video slice. When I do see what I'm to understand is my NALU type (`67 & 1f == 7 == SPS`), it isn't preceded by a start code. Instead it's preceded by a length. Furthermore the length is not what I'd expect. — stevendesu, Dec 11 '18 at 18:31
Those bytes are not part of the video stream. They are part of the frame header. They are the codec Id, packet type, and CTS. Please review the Flv specification. — szatmary, Dec 11 '18 at 18:40
@szatmary Thanks for the direction. Once I took a look at the FLV spec and realized I should be dealing with a single AVCDecoderConfigurationRecord, I scrapped both the HLS.js code and the Node Media Server code and implemented the AVCDecoderConfigurationRecord per the spec. I was able to process the entire packet easily and extract both the SPS and PPS records. — stevendesu, Dec 11 '18 at 19:33

How to extract SPS and PPS from RTMP stream (avc1 encoded)?

0 Answers0