Wave2Lip usage and performance

Question

The actual Question:

Currently opencv is used to write video frames in a single file. Can you append the audio as well directly or is there some other way so you can create small video snippets which can than be broadcastet via rtp protocoll or broadcast it directly from python code?

out = cv2.VideoWriter(
        'temp/result.avi', 
        cv2.VideoWriter_fourcc(*'DIVX'), 
        fps, 
        (frame_w, frame_h))

... #some frame manipulation happening

out.write(f) # f = video frame

I don't want to write the video file and afterwards combine it with audio using ffmpeg.

Background:

I'm trying to write an application which needs real time LypSincing. For that purpose I'm experimenting around with Wave2Lip. At first this library seems to be very slow, but can actually be pretty fast with some optimizations.

Experiments:

I first lypsinced a video with another video file manually using the following command.

python inference.py --checkpoint_path ptmodels\wav2lip.pth --face testdata\face.mp4 --audio testdata\audio.mp4

The face.mp4 file has a duration of 25 seconds, 30fps and a resolution of 854*480 The audio.mp4 file has a duration of 260 seconds, 30fps and resulution of 480x360

The total generation took exactly 109 seconds. After disecting the code and profiling it i found out that there are 2 parts which take the longest:

the face detection part took 48.64 seconds
the lypsincing part took 48.50 seconds

I then tried it with a static image instead of a video which cut's down the time significantly (in my use case i will later only use the same face so i will propably pre calculate the face detection on startup time).

python inference.py --checkpoint_path ptmodels\wav2lip.pth --face testdata\face.jpg --audio testdata\audio.mp4

the face detection part took 1.01 seconds
the lypsincing part took 48.50 seconds

After looking into the lypsincing part i found out that the entire lypsincing video is generated and afterwards combined with the video

for i, (img_batch, mel_batch, frames, coords) in enumerate(tqdm(gen, 
                                        total=int(np.ceil(float(len(mel_chunks))/batch_size)))):
    if i == 0:
        model = load_model(args.checkpoint_path)
        print ("Model loaded")

        frame_h, frame_w = full_frames[0].shape[:-1]
        out = cv2.VideoWriter('temp/result.avi', 
                                cv2.VideoWriter_fourcc(*'DIVX'), fps, (frame_w, frame_h))

    img_batch = torch.FloatTensor(np.transpose(img_batch, (0, 3, 1, 2))).to(device)
    mel_batch = torch.FloatTensor(np.transpose(mel_batch, (0, 3, 1, 2))).to(device)

    with torch.no_grad():
        pred = model(mel_batch, img_batch)

    pred = pred.cpu().numpy().transpose(0, 2, 3, 1) * 255.
    
    for p, f, c in zip(pred, frames, coords):
        y1, y2, x1, x2 = c
        p = cv2.resize(p.astype(np.uint8), (x2 - x1, y2 - y1))

        f[y1:y2, x1:x2] = p
        out.write(f)

out.release()
command = 'ffmpeg -y -i {} -i {} -strict -2 -q:v 1 {}'.format(args.audio, 'temp/result.avi', args.outfile)
subprocess.call(command, shell=platform.system() != 'Windows')

I then decided to profile each lypsincing cycle with the following result:

lypsinc frames generated for batch 0 containing 128 frames with 30.0fps (video part length: 4.26s), took:  3.51s
lypsinc frames generated for batch 1 containing 128 frames with 30.0fps (video part length: 4.26s), took:  0.73s
lypsinc frames generated for batch 2 containing 128 frames with 30.0fps (video part length: 4.26s), took:  0.76s

...

lypsinc frames generated for batch 53 containing 128 frames with 30.0fps (video part length: 4.26s), took:  0.73s
lypsinc frames generated for batch 54 containing 17 frames with 30.0fps (video part length: 0.56s), took:  0.89s

all lypsinc frames generated, took:  48.50s

Conclusion: With face detection solved (or more like on hold) the lypsincing takes roughly 5s before the first batch of video frames is ready. Each lypsinced video batch is 4.26 seconds long and it takes roughly 0.8 second to calculate it. This means if one were to stream this video batches togerther with the audio frames it should be possible to have a lypsincing which starts rendering after 5 seconds delay instead of the 50 second's in this use case.

Welcome to SO; please notice that a complete storyline of what you did and did not is hardly necessary; please edit your post for brevity and focus on your *exact* issue and question. — desertnaut, Feb 21 '22 at 14:04
I'm curious how you got it rendering so fast, I just had 2.5 minutes estimate an hour and 15 minute render time — Skyler, Oct 15 '22 at 05:04
Hi, is can you give a quick update on the approach that worked best for your use case? — Islam Yahia, May 08 '23 at 17:53

Wave2Lip usage and performance

0 Answers0