The actual Question:
Currently opencv is used to write video frames in a single file. Can you append the audio as well directly or is there some other way so you can create small video snippets which can than be broadcastet via rtp protocoll or broadcast it directly from python code?
out = cv2.VideoWriter(
'temp/result.avi',
cv2.VideoWriter_fourcc(*'DIVX'),
fps,
(frame_w, frame_h))
... #some frame manipulation happening
out.write(f) # f = video frame
I don't want to write the video file and afterwards combine it with audio using ffmpeg.
Background:
I'm trying to write an application which needs real time LypSincing. For that purpose I'm experimenting around with Wave2Lip. At first this library seems to be very slow, but can actually be pretty fast with some optimizations.
Experiments:
I first lypsinced a video with another video file manually using the following command.
python inference.py --checkpoint_path ptmodels\wav2lip.pth --face testdata\face.mp4 --audio testdata\audio.mp4
The face.mp4 file has a duration of 25 seconds, 30fps and a resolution of 854*480 The audio.mp4 file has a duration of 260 seconds, 30fps and resulution of 480x360
The total generation took exactly 109 seconds. After disecting the code and profiling it i found out that there are 2 parts which take the longest:
- the face detection part took 48.64 seconds
- the lypsincing part took 48.50 seconds
I then tried it with a static image instead of a video which cut's down the time significantly (in my use case i will later only use the same face so i will propably pre calculate the face detection on startup time).
python inference.py --checkpoint_path ptmodels\wav2lip.pth --face testdata\face.jpg --audio testdata\audio.mp4
- the face detection part took 1.01 seconds
- the lypsincing part took 48.50 seconds
After looking into the lypsincing part i found out that the entire lypsincing video is generated and afterwards combined with the video
for i, (img_batch, mel_batch, frames, coords) in enumerate(tqdm(gen,
total=int(np.ceil(float(len(mel_chunks))/batch_size)))):
if i == 0:
model = load_model(args.checkpoint_path)
print ("Model loaded")
frame_h, frame_w = full_frames[0].shape[:-1]
out = cv2.VideoWriter('temp/result.avi',
cv2.VideoWriter_fourcc(*'DIVX'), fps, (frame_w, frame_h))
img_batch = torch.FloatTensor(np.transpose(img_batch, (0, 3, 1, 2))).to(device)
mel_batch = torch.FloatTensor(np.transpose(mel_batch, (0, 3, 1, 2))).to(device)
with torch.no_grad():
pred = model(mel_batch, img_batch)
pred = pred.cpu().numpy().transpose(0, 2, 3, 1) * 255.
for p, f, c in zip(pred, frames, coords):
y1, y2, x1, x2 = c
p = cv2.resize(p.astype(np.uint8), (x2 - x1, y2 - y1))
f[y1:y2, x1:x2] = p
out.write(f)
out.release()
command = 'ffmpeg -y -i {} -i {} -strict -2 -q:v 1 {}'.format(args.audio, 'temp/result.avi', args.outfile)
subprocess.call(command, shell=platform.system() != 'Windows')
I then decided to profile each lypsincing cycle with the following result:
lypsinc frames generated for batch 0 containing 128 frames with 30.0fps (video part length: 4.26s), took: 3.51s
lypsinc frames generated for batch 1 containing 128 frames with 30.0fps (video part length: 4.26s), took: 0.73s
lypsinc frames generated for batch 2 containing 128 frames with 30.0fps (video part length: 4.26s), took: 0.76s
...
lypsinc frames generated for batch 53 containing 128 frames with 30.0fps (video part length: 4.26s), took: 0.73s
lypsinc frames generated for batch 54 containing 17 frames with 30.0fps (video part length: 0.56s), took: 0.89s
all lypsinc frames generated, took: 48.50s
Conclusion: With face detection solved (or more like on hold) the lypsincing takes roughly 5s before the first batch of video frames is ready. Each lypsinced video batch is 4.26 seconds long and it takes roughly 0.8 second to calculate it. This means if one were to stream this video batches togerther with the audio frames it should be possible to have a lypsincing which starts rendering after 5 seconds delay instead of the 50 second's in this use case.