I'm working on an extension to Node Media Server to save incoming streams to disk as MP4. For this conversion to MP4 I'm leaning heavily on the Apple QuickTime Movie Specification, The ISO/IEC 14496-14 Specification (which I discovered in the Rust-MP4 GitHub repository for free), and The HLS.js Source Code
I'm testing with a single video at the moment. Once this works I'll start experimenting with other videos. For my use case I only need to support H.264 video and AAC audio.
Currently when an RTMP connection is established, the first 3 packets I receive are consistently:
- 1
AMF metadata
packet (RTMP cid = 6) containing information like video width, video height, bitrate, audio sample rate, etc - 1
audio
packet (RMTP cid = 4) containing 7 bytes of data. I assume this is the AAC config packet - 1
video
packet (RTMP cid = 5) containing 46 bytes of data. I assume this is the AVC config packet
When writing the MP4 moov
atom, there are two places where I need to utilize additional information not located in the AMF metadata (and presumably located in these two config packets):
- In the
esds
atom, The HLS.js source appends "config" data. I assume I just append the entire 7-byte payload from the audio config packet here - In the
avcC
atom, The HLS.js source append the "sps" and "pps" data. This is the root of my issue
Regarding the parsing of these 46 bytes, I found code in Node Media Server and HLS.js that seems to parse the same data. The difference between these two pieces of code is that Node Media Server expects an additional 13 bytes of data at the start of the packet. The packet I receive seems to contain these additional 13 bytes, so I simply follow their lead in extracting width
, height
, profile
, compat
, and level
information. The 46 bytes in particular are:
[0x17, 0x00, 0x00, 0x00, 0x00, 0x01, 0x42, 0xc0, 0x1f, 0xff, 0xe1, 0x00, 0x19, 0x67, 0x42, 0xc0, 0x1f, 0xa6, 0x11, 0x02, 0x80, 0xbf, 0xe5, 0x84, 0x00, 0x00, 0x03, 0x00, 0x04, 0x00, 0x00, 0x03, 0x00, 0xc2, 0x3c, 0x60, 0xc8, 0x46, 0x01, 0x00, 0x05, 0x68, 0xc8, 0x42, 0x32, 0xc8]
Breaking this down for the bytes I can easily parse (prior to the use of Exponential Golomb encoding):
[
0x17, // "frame type", specifies H.264 or HVEC
0x00, 0x00, 0x00, 0x00, 0x01, // ignored. Reserved?
0x42, // profile
0xc0, // compat
0x1f, // level
0xff, // "info.nalu" (per Node Media Server source)
0xe1, // "info.nb_sps" (per Node Media Server source)
0x00, 0x19, // "nal size"
// Above here are the bits exclusively seen by Node Media Server (specific to RTMP?)
// Below here are the bits passed to HLS.js as "unit.data" (common to all AVC1 streams?):
0x67, // "nal type"
0x42, // profile (again?)
0xc0, // compat (again?)
0x1f, // level (again?)
// Below here, data is not necessarily byte-aligned as Exponential Golomb encoding is used
// ...
]
Now the problem I'm running into is during the creation of the moov
atom (and the avcC
atom, specifically) I need to know both the sps
and the pps
bytes. From the HLS.js source it looks like the sps
may just be this video config packet minus the first 13 bytes. However how do I find the pps
? Is pps
actually the last few bytes of this packet, and I should split it somewhere? Will this be delivered in another packet? If two video packets are to be expected, is there some way I should differentiate them so I know which one is sps
and which one is pps
?
If I can figure out this last little bit, then I should be completely done writing the moov
packet (after which point I just need to figure out the proper format for the mdat
packet and I should have working code)
Update: For the record, I just checked the fourth packet being delivered to see if it might contain pps
data. After reconnecting to the stream ~20 times the fourth packet was consistently a video
packet (RTMP cid = 5), but the size of the packet ranged from 16000 bytes to 21000 bytes. I suspect this is legitimate video data.
Second Update: I just checked what the offset was in the video config packet when I finished parsing the SPS, and I was on byte 23 (0x84
). It's therefore likely that the PPS is in fact at the end of this byte array, but I'm not sure how many bytes are delimiters / headers (NAL type, NAL length, etc) and how many bytes are the actual PPS.