Media Foundation video re-encoding producing audio stream sync offset

Question

I'm attempting to write a simple windows media foundation command line tool to use IMFSourceReader and IMFSyncWriter to load in a video, read the video and audio as uncompressed streams and re-encode them to H.246/AAC with some specific hard-coded settings.

The simple program Gist is here

sample video 1

sample video 2

sample video 3

(Note: the video's i've been testing with are all stereo, 48000k sample rate)

The program works, however in some cases when comparing the newly outputted video to the original in an editing program, I see that the copied video streams match, but the audio stream of the copy is pre-fixed with some amount of silence and the audio is offset, which is unacceptable in my situation.

audio samples:
original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy     - |[silence] [silence] [silence] [audio1] [audio2] [audio3] ... etc

In cases like this the first video frames coming in have a non zero timestamp but the first audio frames do have a 0 timestamp.

I would like to be able to produce a copied video who's first frame from the video and audio streams is 0, so I first attempted to subtract that initial timestamp (videoOffset) from all subsequent video frames which produced the video i wanted, but resulted in this situation with the audio:

original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy     - |[audio4] [audio5] [audio6] [audio7] [audio8] ... etc

The audio track is shifted now in the other direction by a small amount and still doesn't align. This can also happen sometimes when a video stream does have a starting timestamp of 0 yet WMF still cuts off some audio samples at the beginning anyway (see sample video 3)!

I've been able to fix this sync alignment and offset the video stream to start at 0 with the following code inserted at the point of passing the audio sample data to the IMFSinkWriter:

//inside read sample while loop
...

// LONGLONG llDuration has the currently read sample duration
// DWORD audioOffset has the global audio offset, starts as 0
// LONGLONG audioFrameTimestamp has the currently read sample timestamp

//add some random amount of silence in intervals of 1024 samples
static bool runOnce{ false };
if (!runOnce)
{
    size_t numberOfSilenceBlocks = 1; //how to derive how many I need!?  It's aribrary
    size_t samples = 1024 * numberOfSilenceBlocks; 
    audioOffset = samples * 10000000 / audioSamplesPerSecond;
    std::vector<uint8_t> silence(samples * audioChannels * bytesPerSample, 0);
    WriteAudioBuffer(silence.data(), silence.size(), audioFrameTimeStamp, audioOffset);

    runOnce= true;
}

LONGLONG audioTime = audioFrameTimeStamp + audioOffset;
WriteAudioBuffer(dataPtr, dataSize, audioTime, llDuration);

Oddly, this creates an output video file that matches the original.

original - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc
copy     - |[audio1] [audio2] [audio3] [audio4] [audio5] ... etc

The solution was to insert extra silence in block sizes of 1024 at the beginning of the audio stream. It doesn't matter what the audio chunk sizes provided by IMFSourceReader are, the padding is in multiples of 1024.

A screen shot of the audio track offsets of the different attempts for sample video 2

My problem is that there seems to be no detectable reason for the the silence offset. Why do i need it? How do i know how much i need? I stumbled across the 1024 sample silence block solution after days of fighting this problem.

Some videos seem to only need 1 padding block, some need 2 or more, and some need no extra padding at all!

My question here are:

Does anyone know why this is happening?
Am I using Media Foundation incorrectly in this situation to cause this?
If I am correct, How can I use the video metadata to determine if i need to pad an audio stream and how many 1024 blocks of silence need to be in the pad?

EDIT:

For the sample videos above:

sample video 1 : the video stream starts at 0 and needs no extra blocks, passthrough of original data works fine.
sample video 2 : video stream starts at 834166 (hns) and needs 1 1024 block of silence to sync
sample video 3 : video stream starts at 0 and needs 2 1024 blocks of silence to sync.

UPDATE:

Other things I have tried:

Increasing the duration of the first video frame to account for the offset: Produces no effect.

what are timestamps of first couple of audio and video samples, both in source and resulting files? — Andriy Tylychko, Mar 07 '19 at 23:57
@AndriyTylychko in this case the first video timestamp is `834166` followed by `1251249` then `1668332`. The audio timestamps start at `0` followed by `213333` and `426666` — m1keall1son, Mar 08 '19 at 15:03
^^ for the alpaca test video (needs 1 1024 block of silence to sync) — m1keall1son, Mar 08 '19 at 17:49
If the offset problem is file specific, maybe it's about poor/missing/wrong support of edit list atoms in MP4 files in either Media Foundation or software you are using to compare the output in. That is, maybe you should look into edts/elst atoms in the original files and check if they correlate with offsets you need to add. — Roman R., Mar 08 '19 at 18:40
@RomanR. you may be on to something with that. I'm not familiar with mp4 atoms, but it does mention some limited support in WMF for edts/edlt in writing mp4 files. This link shows atom dumps from the original file and the straight copied file (no manipulation). https://gist.github.com/m1keall1son/02e6d437951fb35a488e2b0ce3744ca1 Can you expand on what i could look for in these atoms to find a problem with the copy? — m1keall1son, Mar 08 '19 at 20:18
Looks like windows media foundation does not add edts/elst entries when creating mp4. which, if i understand correctly what they do, would cause sync issues when its reloaded in another player or editor, right? — m1keall1son, Mar 08 '19 at 21:17
I would suggest that you also attach a link to one of the files after your processing. For example, that "sample video 3" without those two silence blocks, so that lookin at files it's possibly clear where the sync got lost. — Roman R., Mar 08 '19 at 21:49
What looks probable to me though is that Media Foundation overall is okay and does not lose sync. The problem is that it reads `elst`-enabled file and outputs data without `elst`. Your another application might be not aware of `elst` at all, and as a result in loses sync, but it actually loses sync on the first file and not on the second one. I'd look into that first. — Roman R., Mar 09 '19 at 11:48

mofo77 · Accepted Answer · 2019-03-12T23:32:17.200

I wrote another version of your program to handle NV12 format correctly (yours was not working) :

EncodeWithSourceReaderSinkWriter

I use Blender as video editing tools. Here is my results with Tuning_against_a_window.mov :

from the bottom to the top :

Original file
Encoded file
I changed the original file by settings "elst" atoms with the value of 0 for number entries (I used Visual Studio hexa editor)

Like Roman R. said, MediaFoundation mp4 source doesn't use the "edts/elst" atoms. But Blender and your video editing tools do. Also the "tmcd" track is ignored by mp4 source.

"edts/elst" :

Edits Atom ( 'edts' )

Edit lists can be used for hint tracks...

MPEG-4 File Source

The MPEG-4 file source silently ignores hint tracks.

So in fact, the encoding is good. I think there is no audio stream sync offset, comparing to the real audio/video data. For example, you can add "edts/elst" to the encoded file, to get the same result.

PS: on the encoded file, i added "edts/elst" for both audio/video tracks. I also increased size for trak atoms and moov atom. I confirm, Blender shows same wave form for both original and encoded file.

EDIT

I tried to understand relation between mvhd/tkhd/mdhd/elst atoms, in the 3 video samples. (Yes I know, i should read the spec. But i'm lazy...)

You can use a mp4 explorer tool to get atom's values, or use the mp4 parser from my H264Dxva2Decoder project :

H264Dxva2Decoder

Tuning_against_a_window.mov

elst (media time) from tkhd video : 20689
elst (media time) from tkhd audio : 1483

GREEN_SCREEN_ANIMALS__ALPACA.mp4

elst (media time) from tkhd video : 2002
elst (media time) from tkhd audio : 1024

GOPR6239_1.mov

elst (media time) from tkhd video : 0
elst (media time) from tkhd audio : 0

As you can see, with GOPR6239_1.mov, media time from elst is 0. That's why there is no video/audio sync problem with this file.

For Tuning_against_a_window.mov and GREEN_SCREEN_ANIMALS__ALPACA.mp4, i tried to calculate the video/audio offset. I modified my project to take this into account :

EncodeWithSourceReaderSinkWriter

For now, i didn't find a generic calculation for all files.

I just find the video/audio offset needed to encode correctly both files.

For Tuning_against_a_window.mov, i begin encoding after (movie time - video/audio mdhd time). For GREEN_SCREEN_ANIMALS__ALPACA.mp4, i begin encoding after video/audio elst media time.

It's OK, but I need to find the right unique calculation for all files.

So you have 2 options :

encode the file and add elst atom
encode the file using right offset calculation

it depends on your needs :

The first option permits you to keep the original file.But you have to add the elst atom
With the second option you have to read atom from the file before encoding, and the encoded file will loose few original frames

If you choose the first option, i will explain how I add the elst atom.

PS : i'm intersting by this question, because in my H264Dxva2Decoder project, the edts/elst atom is in my todo list. I parse it, but i don't use it...

PS2 : this link sounds interesting : Audio Priming - Handling Encoder Delay in AAC

Thank you for your detailed examination of this problem. This is does look like what is happening. I'm trying to recreate your solution but am stuck on what to put in the edts/elst atoms on the copied video. When you say in your PS that you added the atoms back for both audio/video tracks, can you expand on what exactly where the values that you added back into the elst fields on those tracks? Thanks! — m1keall1son, Mar 12 '19 at 15:29

Media Foundation video re-encoding producing audio stream sync offset

1 Answers1