2

I have been doing some research on encoding/decoding and streaming video between machines and I feel as though I've gotten a fairly good grasp on the pipeline from file to stream. I can open the container, decode and grab individual frames and audio chunks and I imagine moving those frames over the network is as easy as simply sending byte data (albeit primitive and inefficient). What I don't understand is how its actually played. Simply writing the frames to some image box and dropping the audio data off at the sound card buffer doesn't really work very well. Can anyone explain to me exactly whats happening in programs like vlc player or windows media player that allows them to send all of this frame data to the screen without destroying the cpu and memory? Just the general idea or some high level documentation would be great. I don't even know where to get started...

Thank you!

Dabloons
  • 1,462
  • 2
  • 17
  • 30
  • VLC is open source isnt it? You could check that :) – Simon Whitehead Jan 18 '13 at 02:34
  • 1
    I just remembered how we did it with an mpeg lib on the pocket pc. And it was just like you described. We used a 2D lib, opened a surface, copied the frame bits into the surface and flipped that to the screen. Now of course you could buffer a couple of frames ahead, or do neat stuff like interpolating between the frames. But i don't think at this point there is much more magic, than just bringing the pixels to the screen. – dowhilefor Jan 18 '13 at 02:36
  • draw image and put audio data, with audio-video time sync. – 9dan Jan 18 '13 at 03:31

2 Answers2

1

If you use OpenGL, you can create a texture and constantly replace it with the new frame data. It's not a very expensive operation. You then draw a textured rectangle to your window. glOrtho is the useful projection here.

In Windows, the same kind of thing would apply if you use DirectX or Direct3D. You can even get good performance blitting DIB Sections (GDI): Fastest method for blitting from a pixel buffer into a device context

Regardless of how you draw the pixels, you set a timer for updates, and it's that simple.

To get smooth operation, you need to buffer ahead of the drawing so that disk (or network) and decoding delays do not impact real-time drawing. Even the slightest jerkiness in video can be perceived by humans. By the time your timer fires you need to have the pixels decoded in an image buffer and ready to draw.

Community
  • 1
  • 1
paddy
  • 60,864
  • 6
  • 61
  • 103
  • 1
    +1 but `bitblt`-ing in Windows is subject to an unfortunate side-effect called [tearing](http://stackoverflow.com/questions/2448831/how-to-eliminate-tearing-from-animation) which occurs whenever the image update gets caught in a refresh. You have to use Direct... or the newer APIs to avoid tearing, unfortunately. – MusiGenesis Jan 18 '13 at 02:50
  • Yeah I remember tearing from when I cut my teeth writing blit routines in assembly on a 386 =) I take it that the oldschool method of waiting for a VSync wouldn't apply... Thanks for your comment. – paddy Jan 18 '13 at 02:52
  • I ran into this exact problem not long ago while writing a player. I came up with a weird workaround that involved using DirectX just to time the v-sync events (which DirectX exposes). You just need to register a few events to get the timing of the v-syncs down since they happen with near-perfect regularity. In the animation engine you just make sure that you don't render any frames if you're within a couple of milliseconds of the sync on either side; instead you delay until just after. It works but it's remarkably stupid, since if you're going to use DirectX you might as well just use DirectX – MusiGenesis Jan 18 '13 at 03:16
1

I have written a number of player applications (for Windows) that combine video and audio and require precise synchronization between the two. In Windows audio, you basically prepare buffers (which are just arrays of audio sample values) and queue them up to the audio subsystem for playback; the subsystem makes a callback to your app as each buffer completes playback, and your app uses each callback to 1) render the next frame to the screen, and 2) prepare the next chunk of audio to be queued up to the audio subsystem.

For example, let's say you have some frames of video in memory that you want to play at 50 frames per second, in sync with audio that is mono, 2 bytes per sample and 44,100 samples per second. This means that your audio buffers need to each be 882 samples in size (44,100 / 50 = 882), so each buffer is just an array of short (2 byte) integers with 882 elements. You need at least two buffers, but in practice more is better (the tradeoff with buffers is that more buffers means smoother playback at the cost of a longer delay in starting and a larger memory footprint).

The frames of the video need to be "buffered" in the same way so that at least one frame is always ready to be rendered; transferring a single image to a PC screen is so fast that it's effectively instantaneous and not something you need to be worried about. The only concern is with whatever method extracts or composes the frames. These methods needs to be at least fast enough to keep up with the playback rate or they need to be buffered well in advance of playback, which also results in a longer startup delay and a larger memory footprint (these problems are much worse for video than they are for audio, with any reasonable resolution).

As the app begins playback, it pre-loads all of the buffers with audio and queues them up for playback; then, it simultaneously starts playback and renders the first frame to the screen. The user sees the first frame and hears the first 20 ms of audio (20 ms = 1/50 second). At this point the audio subsystem switches playback from the first buffer to the second buffer, and makes a callback to the application. The app then renders the second frame to the screen and fills the first buffer with the next available chunk of audio, then queues up this first buffer again to the audio subsystem.

As long as the application has audio and video data available to keep filling up buffers and frames this process continues and you see/hear the video.

MusiGenesis
  • 74,184
  • 40
  • 190
  • 334