How to synchronize audio and video using ffmpeg libraries?

Question

Stuck writing a very basic media player in C, using SDL and ffmpeg libraries. Initially, followed the theory in this page to get an idea about the entire program and the usage of libraries. After coding from scratch, thanks to that tutorial and many other resources, finally I made my code work, using the latest libraries of ffmpeg and SDL (2.0). But my code lacks a proper synchronization mechanism (actually it lacks a sync mechanism !).

I still don't have a clear idea on how to synchronize the audio and video together as the theory provided in the link is only very partially correct (atleast when it comes to using the latest dev libraries).
For example, a sentence in this page is as follows :

However, ffmpeg reorders the packets so that the DTS of the packet being processed by avcodec_decode_video() will always be the same as the PTS of the frame it returns.

I am using avcodec_decode_video2() and the DTS of the packet is definitely not the same as pts of the frame it decodes (in general).

I read this very informative BBC report and it makes complete sense. I have a clear idea about PTS and DTS. But the PTS and DTS values that ffmpeg is using for packets and decoded frames is confusing. I wish there were some documentation on that aspect.

Can someone explain the steps to synchronize audio and video ? I only need the steps. I am quite comfortable implementing them. Any help is greatly appreciated. Thanks !

PS : Here's a screenshot of what I am talking about :

enter image description here

The huge negative value is, I assume AV_NOPTS_VALUE.

score 3 · Answer 1 · edited May 23 '17 at 12:15

This is not a direct answer, but is a lot of useful information for the above problem. After going through more information and coding a little, the following are my observations :

I provided a .mpg file as input and these are my observations :

BBC RD 1996/3 in it's very informative report says :

To enable backward prediction from a future frame, the coder re-orders the pictures from natural display order to ‘transmission’ (or ‘bitstream’) order so that the B-picture is transmitted after the past and future pictures which it references. (See Fig. 14). This introduces a delay which depends upon the number of consecutive B-pictures.

The provided input file had it's first few video frames as follows : (in their natural display order)

I0 B0 B1 P0 B2 B3 P1 B4 B5 P2 B6 B7 P3 B8 B9 I1 ...
But the encoder (during the process of encoding, at some time in the past when the file was encoded) puts the packets in the video stream as : (this is to enable decoding of P and B frames)

I0 P0 B0 B1 P1 B2 B3 P2 B4 B5 P3 B6 B7 I1 B8 B9 ...
Now, when av_read_frame() reads packets from the video stream, they are obtained in the same above order :

I0 P0 B0 B1 P1 B2 B3 P2 B4 B5 P3 B6 B7 I1 B8 B9 ...
This is what avcodec_decode_video2() does (or atleast is doing in this case) :

Input I0 (pts_I0, dts_I0) -----> DECODER ----> No Output Frame
Input P0 (pts_P0, dts_P0) -----> DECODER ----> Output I0 (pts_I0, dts_P0)
Input B0 (pts_B0, dts_B0) -----> DECODER ----> Output B0 (pts_B0, dts_B0)
Input B1 (pts_B1, dts_B1) -----> DECODER ----> Output B1 (pts_B1, dts_B1)
Input P1 (pts_P1, dts_P1) -----> DECODER ----> Output P0 (pts_P0, dts_P1)
Input B2 (pts_B2, dts_B2) -----> DECODER ----> Output B2 (pts_B2, dts_B2)
Input B3 (pts_B3, dts_B3) -----> DECODER ----> Output B3 (pts_B3, dts_B3)
Input P2 (pts_P2, dts_P2) -----> DECODER ----> Output P1 (pts_P1, dts_P2)
Input B4 (pts_B4, dts_B4) -----> DECODER ----> Output B4 (pts_B4, dts_B4)
Input B5 (pts_B5, dts_B5) -----> DECODER ----> Output B5 (pts_B5, dts_B5)
Input P3 (pts_P3, dts_P3) -----> DECODER ----> Output P2 (pts_P2, dts_P3)
Input B6 (pts_B6, dts_B6) -----> DECODER ----> Output B6 (pts_B6, dts_B6)
Input B7 (pts_B7, dts_B7) -----> DECODER ----> Output B7 (pts_B7, dts_B7)
Input I1 (pts_I1, dts_I1) -----> DECODER ----> Output P3 (pts_P3, dts_I1)
Input B8 (pts_B8, dts_B8) -----> DECODER ----> Output B8 (pts_B8, dts_B8)
Input B9 (pts_B9, dts_B9) -----> DECODER ----> Output B9 (pts_B9, dts_B9)
```
  Next Input Packet ---------> DECODER ---------->  Next Output Frame   
```
```
 (pts_PKT, dts_PKT)                                I1 (pts_I1, dts_PKT) 
```

I think you can now notice that, at every step of decoding, the decoder already has the other frames (either past frames or future frames of the natural display order) to successfully decode the input packet. The decoder outputs frames in the natural display order. Also as far as I observed, usually the pts for access units (packets) containing I or P frames is AV_NOPTS_VALUE.

PS : I do not know ASCII art ! Sorry if the illustration is not too good. Hope it helped others.
Now after knowing this, I guess it helps in understanding pts and dts much better.
This link and this link are the other ones which I found useful.

How to synchronize audio and video using ffmpeg libraries?

1 Answers1