How can HTML5 video's byte-range requests (pseudo-streaming) work?

Question

If you play an HTML5 video for a video that is hosted on a server that accepts range requests, then when you try to seek ahead to a non-buffered part of the video you'll notice from the network traffic that the browser makes a byte range-request. I'm assuming that the browser computes the byte by knowing the total video size ahead of time and assuming a constant bitrate (if you click half-way in the progress bar, then it will request the byte at the half-way point). But especially if the video is variable bitrate, it seems unlikely that the byte it requests could really correspond to the time-point that the user clicked on, and the byte would likely fall in the middle of a frame.

How does the browser know what the beginning of the next frame is, once it's begun fetching at some arbitrary byte?

szatmary · Accepted Answer · 2013-08-12T05:05:57.877

21

I assume your video is in an Mp4 container. The mp4 file format contains a hierarchical structure of 'boxes'. One of these boxes is the Time-To-Sample (stts) box. This box contains the time of every frame (in a compact fashion). From here you can find the 'chunk' that contains the frame using the Sample-to-Chunk (stsc) atom. And finally the Chunk offset atom (stco) gives you the byte offset into the file.

The total duration of the movie is store in the Movie header atom (mvhd). When you move the scrub handle, a time is estimated based on the duration of the movie and where you let go of the scrub handle, a calculation is made from the the file header downloaded previously, and a request is made.

Edit: If it is not mp4, other containers have similar mechanism. Codec is irrelevant.

edited Aug 12 '13 at 05:05

answered Aug 12 '13 at 03:03

szatmary

29,969
8
44
57

This answer is most convincing to me, but I'm not entirely convinced that this is indeed the way the browsers are doing it. I totally agree that it can be done this way though. To clarify, I found this blog helpful: http://thompsonng.blogspot.com/2010/11/mp4-file-format.html If what you're saying is correct, then seeking, even in a super large video with variable bitrate, should be totally accurate? Also, do you have any references that prove that browsers are using the moov atom (stts, stsc, stco data) for seeking? – bhh1988 Aug 12 '13 at 06:54
1

Personally I like to go directly to the source. That is ISC/IEC 14496-12. Or the older "QuickTime File Format" documentation. Yes, I am sure this is how the browser does it for mp4, because it is the only way to seek in an mp4. – szatmary Aug 12 '13 at 15:39
You can not walk the mdat, because audio and video chunks are interleaved with no indication when one chunk stops and another begin. The only docs will be the browser source code. Note that each browser will likely use some sort of media abstraction. So the browser will simply call seekToTime() and the media layer will determine what HTTP requests need to be sent. – szatmary Aug 12 '13 at 15:47

Thomas W · Answer 2 · 2013-08-12T05:20:21.560

1

Many video/media types, such as MPEG, are encoded in fixed-same packets.

MPEG was originally designed on 188-byte packets (originally chosen to be 8 cells of the ATM transport layer, though that is now obsolete). So if you seek to a multiple of that 188-byte size, the player will read valid packets & recover sync when it finds the beginning of a frame.

Actual picture can be displayed, when the browser/player reaches an I-frame (or keyframe) which can be decoded independently of any other frames. P- and B-frames are interpolations, so if you seek to them you can't yet construct a picture.

See:

edited Aug 12 '13 at 05:20

answered Aug 12 '13 at 01:29

Thomas W

13,940
4
58
76

What about h.264? Also, it's not clear to me how this fits in with frames. What if the request fetched a packet from the middle of a frame? How does it know where the frame boundaries are so that it can properly continue playback? – bhh1988 Aug 12 '13 at 01:37
Frames are marked so that you *can* seek in a packet stream & recover sync. Player will typically read & skip until it finds a "complete picture" frame starting. If you wish to research packet-sizes, Google H.264 yourself.. – Thomas W Aug 12 '13 at 01:43
Not correct. Only HTTP Live Streaming uses Transport streams. With HLS, time positioning is determined by the manifest (m3u8) file, and entire segments are downloaded. No byte-ranges. – szatmary Aug 12 '13 at 05:08
The question is, _how to know the beginning of the next frame after fetching at an arbitrary byte._ My response answers this. Granted, it may be preferable for a browser to request a position in seconds & have the server resolve it.. But that's not what the question asks. – Thomas W Aug 12 '13 at 05:18
Why the downvotes? Technically correct, answers the question asked about byte offsets -> frames, and with links to the spec. – Thomas W Aug 12 '13 at 05:21
@ThomasW It's not clear to me from your references how these frames are "marked", so that you may know for sure that you've hit the border of a frame. Also, your references are from old standards, like MPEG-1. – bhh1988 Aug 12 '13 at 05:26
More recent MPEG are all built on the same core packetization, use the MPEG-1 video technology, and so these refs are up-to-date. We used to have an MPEG book at my previous work, which detailed actually how to read the packet & frame-types -- but granted, it's complex & the Wikipedia reference isn't super-clear or detailed enough to implement. – Thomas W Aug 12 '13 at 05:34
IIRC it was the "MPEG Handbook". This is a good reference for actually understanding packets, multiplexed streams, and finally getting up to find the frame marker. http://www.amazon.com/MPEG-Handbook-John-Watkinson/dp/024080578X/ref=sr_1_2?s=books&ie=UTF8&qid=1376285786&sr=1-2 – Thomas W Aug 12 '13 at 05:38
The video tag does not work with mpeg transport streams. I have written many mpeg TS parsers and generators from memory. I know the standard VERY well. There is no way to find the start of a frame without downloading each packet and looking for a payload unit start indicator. In the case of mp4, it is still done client size. the moov box contains all the information needed to randomly access any frame in the file. – szatmary Aug 12 '13 at 06:25
@szatmary I think Thomas is suggesting that what is actually happening is that packets are indeed downloaded by the client until the indicator of the start of a frame is found. I suppose this is theoretically one way that the seeking could work? – bhh1988 Aug 12 '13 at 06:50
Yes, as I understand the standard -- so long as one stays synced on the fundamental MPEG packet-size, player/client will be able to read & scan until picking up an I-frame & play from there. This was all designed to allow seeking, clients to join onto existing live streamcasts etc. Earlier MPEG files (most of them) don't have a "time -> offset" table at the start, players seek by packet offset. (Which could be supplemented with a binary search for desired timecode.) – Thomas W Aug 12 '13 at 08:44
Don't know how the HTML 5 browser support works, or about the STTS box -- MPEG files we worked with were all MPEG-2, and none of this extra spec existed. The standard's already designed to allow seek, join onto existing streams, and recovery from lost packets. – Thomas W Aug 12 '13 at 08:55
Actually, TS was designed so you can locate the start of a packet even over a serial connection. you just scan bit by bit until you receive 0100 0111 and try to parse that as a PAT. if the checksum passes, you are byte aligned and ready to decode. But you can not jump to a point in time without doing a binary search. In MP4 the time index is at the beginning of the file. TS is a streaming format. Mp4 is a random access format. That is why they call it pseudo-streaming when you play an mp4. – szatmary Aug 13 '13 at 00:15

How can HTML5 video's byte-range requests (pseudo-streaming) work?

2 Answers2

Linked