I have written a number of player applications (for Windows) that combine video and audio and require precise synchronization between the two. In Windows audio, you basically prepare buffers (which are just arrays of audio sample values) and queue them up to the audio subsystem for playback; the subsystem makes a callback to your app as each buffer completes playback, and your app uses each callback to 1) render the next frame to the screen, and 2) prepare the next chunk of audio to be queued up to the audio subsystem.
For example, let's say you have some frames of video in memory that you want to play at 50 frames per second, in sync with audio that is mono, 2 bytes per sample and 44,100 samples per second. This means that your audio buffers need to each be 882 samples in size (44,100 / 50 = 882), so each buffer is just an array of short (2 byte) integers with 882 elements. You need at least two buffers, but in practice more is better (the tradeoff with buffers is that more buffers means smoother playback at the cost of a longer delay in starting and a larger memory footprint).
The frames of the video need to be "buffered" in the same way so that at least one frame is always ready to be rendered; transferring a single image to a PC screen is so fast that it's effectively instantaneous and not something you need to be worried about. The only concern is with whatever method extracts or composes the frames. These methods needs to be at least fast enough to keep up with the playback rate or they need to be buffered well in advance of playback, which also results in a longer startup delay and a larger memory footprint (these problems are much worse for video than they are for audio, with any reasonable resolution).
As the app begins playback, it pre-loads all of the buffers with audio and queues them up for playback; then, it simultaneously starts playback and renders the first frame to the screen. The user sees the first frame and hears the first 20 ms of audio (20 ms = 1/50 second). At this point the audio subsystem switches playback from the first buffer to the second buffer, and makes a callback to the application. The app then renders the second frame to the screen and fills the first buffer with the next available chunk of audio, then queues up this first buffer again to the audio subsystem.
As long as the application has audio and video data available to keep filling up buffers and frames this process continues and you see/hear the video.