I'm trying to optimise a piece of software for playing video, which internally uses the FFmpeg libraries for decoding. We've found that on some large (4K, 60fps) video, it sometimes takes longer to decode a frame than that frame should be displayed for; sadly, because of the problem domain, simply buffering/skipping frames is not an option.
However, it appears that the FFmpeg executable is able to decode the video in question fine, at about 2x speed, so I've been trying to work out what we're doing wrong.
I've written a very stripped-back decoder program for testing; the source is here (it's about 200 lines). From profiling it, it appears that the one major bottleneck during decoding is the avcodec_send_packet()
function, which can take up to 50ms per call. However, measuring the same call in FFmpeg shows strange behaviour:
(these are the times taken for each call to avcodec_send_packet() in milliseconds, when decoding a 4K 25fps VP9-encoded video.)
Basically, it seems that when FFmpeg uses this function, it only really takes any amount of time to complete every N calls, where N is the number of threads being used for decoding. However, both my test decoder and the actual product use 4 threads for decoding, and this doesn't happen; when using frame-based threading, the test decoder behaves like FFmpeg using only 1 thread. This would seem to indicate that we're not using multithreading at all, but we've still seen performance improvements by using more threads.
FFmpeg's results average out to being about twice as fast overall as our decoders, so clearly we're doing something wrong. I've been reading through FFmpeg's source to try to find any clues, but it's so far eluded me.
My question is: what's FFmpeg doing here that we're not? Alternatively, how can we increase the performance of our decoder?
Any help is greatly appreciated.