How should frames be packed in a live WebM stream?

Question

I'm encoding a live stream with VP9 via libvpx and want to stream it over to a HTML5 player. I've read the Matroska specification and W3C WebM Byte Stream Format and examined a couple of WebM files generated by the vpxenc tool from libvpx. Everything seems nice, however I could not find any strict rules or guidelines on how to pack the encoded video frames inside the media segment described in the W3C specification.

As far as I understand I have to emit media segments that contain clusters with block elements inside. From what I understand I can use a simple block element for each frame I get from the encoder since it has a single timestamp. But how to organize clusters? For me it makes sense to emit a single cluster for each frame with a single simple block entry to reduce buffering and lag. Is such approach considered normal or are there any drawbacks to doing so and I should buffer for some time interval and then emit a cluster that contains multiple simple block elements covering the buffered time period?

UPDATE

So I implemented the described approach (emitting clusters with single simple block entry) and the video seems to lag a lot so presumably this is not the way to go.

score 4 · Accepted Answer · answered Nov 02 '15 at 17:51

So I finally managed to mux ar working live stream.

It seems that the initial approach I described (having a single cluster with a single SimpleBlock) actually works as such, but it has several downsides:

It is kind of violating the recommendations on the official WebM page

Key frames SHOULD be placed at the beginning of clusters

It breaks up possible seeking if the live stream is stored in a local file with curl or other means. From my understanding a Cluster should consist of a full GOP.

One of my initial assumptions is that a Cluster cannot have an "unknown" size, but in practice it seemed out that Chrome, VLC and ffplay were happy with that and so there is no need to buffer a full GOP to determine the size and the Cluster can be emitted on the fly.

Another important aspect is that the timestamps in the SimpleBlock elements are signed 16bit integers so you basically can encode an offset from the cluster timecode up to 32767 in that. So if you are using the default timescale where 1 tick is 1ms, this means a Cluster cannot be longer than 32 seconds. In case the GOP size is huge this criteria must also be taken into account when deciding whether to emit a new cluster.

Finally, here is a link to a live stream (The "Big Buck Bunny" trailer, but in a live format) that seems to work with all the players and is generated as per the description above.

Hope this information helps anyone.

I have been trying to stream webm video recorded via mediarecorder and stream it other end with media stream. Works fine but when some joins from middle of stream it doesn't work. Do you have any example for same please — Tarun Rawat, Jun 11 '20 at 18:46
@TarunRawat have you tried forcing keyframes when upon the join of a new client? Are you using a closed gop or an infinite one ? — Rudolfs Bundulis, Jun 12 '20 at 10:09

score 0 · Answer 2 · answered Sep 10 '15 at 17:10

I think the answer depends on what type of latency you're looking for. Your described approach will work, but will introduce delay. This is typically not an issue for live-streaming, since the goal of live-streaming is not low-latency, but simply direct transmission. (In fact, in some cases, delay is wanted, even if it's still live streaming.)

If low-latency is a goal, you should look into things like RTP. It's not that it's not possible with the webm container, but it's just not a goal of the container, so you'll find that most tools implement webm in a way that doesn't care about low latency, since you wouldn't use it for that purpose; you'd use RTP instead.

(If instead, by lag, you mean that it stutters, please do mention it, because that suggests something different is going on.)

Well as stated the goal is to stream to the native HTML5 video element, so RTP is not an option. And yeah by lag I meant lag, seems to me that even though I correctly mark the blocks that contain keyframes the video is buffered for a very long time in vlc, ffplay and browser (possibly they are expecting a cluster with multiple frames?). — Rudolfs Bundulis, Sep 11 '15 at 11:12

How should frames be packed in a live WebM stream?

2 Answers2