How to combine multiple videos with AVMutableComposition without loosing audio sync?

Question

I'm trying to write video exporter which can combine N videos and place them side by side. Below example is with 3 videos. Videos are recorded WebRTC streams of speakers, and each has a bit different frame rate, possibly also recorded with variable frame rate as RTC videos can drop rate at times.

The problem is when merging videos in hstack side by side, the audio gets de-synchronized with its corresponding videos (audio no longer follows the lips of the speaker after some time)

I tried also to pre-export individual videos with ffmpeg to constant frame rate, but de-sync still occurs. Similar audio de-sync happened also if I tried to use hstack filter with ffmpeg so at the end I gave up trying with it... but to my 'horror' same de-sync happens also when combining with AVFoundation.

Any advice how to get the audios in sync with combined video?

func hstackVideos() {
    let videoPaths: [String] = [
        "path/to/video1.mp4",
        "path/to/video2.mp4",
        "path/to/video3.mp4",
    ]

    let composition = AVMutableComposition()

    let assetInfos: [(AVURLAsset, AVAssetTrack, AVMutableCompositionTrack, AVAssetTrack, AVMutableCompositionTrack)] = videoPaths.map {
        let asset = AVURLAsset(url: URL(fileURLWithPath: $0))
        let track = composition.addMutableTrack(withMediaType: AVMediaType.video, preferredTrackID: kCMPersistentTrackID_Invalid)!
        let videoAssetTrack = asset.tracks(withMediaType: .video)[0]
        try! track.insertTimeRange(videoAssetTrack.timeRange, of: videoAssetTrack, at: CMTime.zero)
        let audioTrack = composition.addMutableTrack(withMediaType: .audio, preferredTrackID: kCMPersistentTrackID_Invalid)!
        let audioAssetTrack = asset.tracks(withMediaType: .audio)[0]
        try! audioTrack.insertTimeRange(audioAssetTrack.timeRange, of: audioAssetTrack, at: CMTime.zero)
        return (asset, videoAssetTrack, track, audioAssetTrack, audioTrack)
    }

    let stackComposition = AVMutableVideoComposition()

    stackComposition.renderSize = CGSize(width: 512, height: 288)
    stackComposition.frameDuration = CMTime(seconds: 1/30, preferredTimescale: 600)
    // stackComposition.frameDuration = assetInfos[0].1.minFrameDuration

    var i = 0
    let instructions: [AVMutableVideoCompositionLayerInstruction] = assetInfos.map { (asset, assetTrack, compTrack, _, _) in
        let lInst = AVMutableVideoCompositionLayerInstruction(assetTrack: compTrack)
        let w: CGFloat = 512/CGFloat(assetInfos.count)
        let inRatio = assetTrack.naturalSize.width / assetTrack.naturalSize.height
        let cropRatio = w / 288
        let scale: CGFloat
        if inRatio < cropRatio {
            scale = w / assetTrack.naturalSize.width
        } else {
            scale = 288 / assetTrack.naturalSize.height
        }
        lInst.setCropRectangle(CGRect(x: w/scale, y: 0, width: w/scale, height: 288/scale), at: CMTime.zero)
        let transform = CGAffineTransform(scaleX: scale, y: scale)
        let t2 = transform.concatenating(CGAffineTransform(translationX: -w + CGFloat(i)*w, y: 0))
        lInst.setTransform(t2, at: CMTime.zero)
        i += 1
        return lInst
    }

    let inst = AVMutableVideoCompositionInstruction()
    inst.timeRange = CMTimeRange(start: CMTime.zero, duration: assetInfos[0].0.duration)
    inst.layerInstructions = instructions

    stackComposition.instructions = [inst]

    let exporter = AVAssetExportSession(asset: composition, presetName: AVAssetExportPresetHighestQuality)!
    let outPath = "path/to/finalVideo.mp4"
    let outUrl = URL(fileURLWithPath: outPath)
    try? FileManager.default.removeItem(at: outUrl)
    exporter.outputURL = outUrl
    exporter.videoComposition = stackComposition
    exporter.outputFileType = .mp4
    exporter.shouldOptimizeForNetworkUse = true

    let group = DispatchGroup()
    group.enter()
    exporter.exportAsynchronously(completionHandler: {
        switch exporter.status {
            case .completed:
                print("SUCCESS!")
                if exporter.error != nil {
                    print("Error: \(String(describing: exporter.error))")
                    print("Description: \(exporter.description)")
                }
                group.leave()
            case .exporting:
                let progress = exporter.progress
                print("Progress: \(progress)")
                
            case .failed:
                print("Error: \(String(describing: exporter.error))")
                print("Description: \(exporter.description)")
                group.leave()
            default:
                break
        }
    })

    group.wait()
}

[Update 29/07/2021]

I've checked input and output durations of the audio and video tracks. Here are the results (in seconds): Input videos:

video 1: (video track: 1086.586, audio track: 1086.483)
video 2: (video track: 1086.534, audio track: 1086.473)
video 3: (video track: 1086.5, audio track: 1086.483)

Output video had three audio tracks with significantly modified durations: (video track: 1086.5855, a1 track: 1079.208, a2 track: 1083.8826666666666, a3 track: 1086.5855)

I'm also noticing small difference in nominalFrameRate of source and destination audio tracks: (source rates: 46.786236, 46.561222, 46.762463, destination rates: 46.874996, 46.875, 46.875). This could explain duration difference, although I don't know what is frame rate in audio and why the exporter changes it.

I also tried to use AVMutableAudioMix but sync issue was still there.

It seems that input videos use some kind of scaling for duration of audio tracks that get lost when placing in the composition. Any suggestion how to get these?

could you provide a bunch of videos on which I can reproduce it? — Phil Dukhov, Jul 28 '21 at 10:18
@Philip I updated my answer with new findings. I'll try to get you my test videos just need to ask everybody if they are ok with it. — Matej Ukmar, Jul 29 '21 at 07:56
@Philip you can download my test videos here (two of them, not all three, the third presenter didn't want to share, but audio sync issue is there with two as well) https://wetransfer.com/downloads/d8db53b6bd1a4340957789a12b97e0de20210729100845/219d5c2e70eecdbf672256eaeb604aa120210729100903/502921 — Matej Ukmar, Jul 29 '21 at 10:13
I've checked it out, the problem is not the audio: I've compared original vs generated audio tracks, they are correct. But it's the video, and the bug happens even if I'm trying your function on two same videos. Haven't seen such thing, not sure where to look from here. I think converting each video by `AVAssetExportSession` before passing to the final func may help, but it'll increase processing time a lot — Phil Dukhov, Jul 30 '21 at 03:02

How to combine multiple videos with AVMutableComposition without loosing audio sync?

0 Answers0