Multi process Video Processing

Question

I would like to do video processing on neighboring frames. More specific, I would like to compute the mean square error between neighboring frames:

mean_squared_error(prev_frame,frame)

I know how to compute this in a linear straightforward way: I use the imutils package to utilize a queue to decouple loading the frames and processing them. By storing them in a queue, I don't need to wait for them before I can process them. ... but I want to be even faster...

# import the necessary packages to read the video
import imutils
from imutils.video import FileVideoStream
# package to compute mean squared errror
from skimage.metrics import mean_squared_error

if __name__ == '__main__':

    # SPECIFY PATH TO VIDEO FILE
    file = "VIDEO_PATH.mp4" 

    # START IMUTILS VIDEO STREAM
    print("[INFO] starting video file thread...")
    fvs = FileVideoStream(path_video, transform=transform_image).start()

    # INITALIZE LIST to store the results
    mean_square_error_list = []

    # READ PREVIOUS FRAME
    prev_frame = fvs.read()

    # LOOP over frames from the video file stream
    while fvs.more():

        # GRAP THE NEXT FRAME from the threaded video file stream
        frame = fvs.read()

        # COMPUTE the metric
        metric_val = mean_squared_error(prev_frame,frame)
        mean_square_error_list.append(1-metric_val) # Append to list

        # UPDATE previous frame variable 
        prev_frame = frame

Now my question is: How can I mutliprocess the computation of the metric to increase speed and save time ?

My operating system is Windows 10 and I am using python 3.8.0

If you can read all frames into memory first, align them into a list of frames, you can possibly use multiprocessing or concurrent.futures. — TYZ, Apr 30 '20 at 20:04
@TYZ very good idea. In principle this would work, but the videos are veeery long (1h+), and my ram is quite limited (only 8GB), hence I will run into memory issues very soon. — henry, Apr 30 '20 at 20:06
@TYZ but...maybe I could split the video into smaller more manageable chunks and process each chunk at a time — henry, Apr 30 '20 at 20:09
@TYZ the question then would be: How do I handle the computation of the metric between the last frame of chunk 1 and the first frame of chunk 2. Do you see what I mean ? The brute force way would be to just calculate those edge cases at the end, but I am wondering if there is a nicer way... — henry, Apr 30 '20 at 20:10
You might could give numba a try. You should be able to run multiple frames with `prange` — gnodab, Apr 30 '20 at 21:01
@gnodab Thank you for your comment. I have not heard about numba before. This definatley sounds interesting. I am just not sure, if it can handle opencv. Do you know, if this is possible with numba ? Many thanks. — henry, May 01 '20 at 10:48

Zabir Al Nazi · Accepted Answer · 2020-05-01T13:33:32.377

There are too many aspects of making things faster, I'll only focus on the multiprocessing part.

As you don't want to read the whole video at a time, we have to read the video frame by frame.

I'll be using opencv (cv2), numpy for reading the frames, calculating mse, and saving the mse to disk.

First, we can start without any multiprocessing so we can benchmark our results. I'm using a video of 1920 by 1080 dimension, 60 FPS, duration: 1:29, size: 100 MB.

import cv2
import sys
import time

import numpy as np
import subprocess as sp
import multiprocessing as mp

filename = '2.mp4'

def process_video():    
    cap = cv2.VideoCapture(filename)

    proc_frames = 0

    mse = []
    prev_frame = None
    ret = True
    while ret:
        ret, frame = cap.read() # reading frames sequentially
        if ret == False:
            break

        if not (prev_frame is None):
            c_mse = np.mean(np.square(prev_frame-frame))
            mse.append(c_mse)

        prev_frame = frame

        proc_frames += 1

    np.save('data/' + 'sp' + '.npy', np.array(mse))

    cap.release()
    return


if __name__ == "__main__":

    t1 = time.time()

    process_video()

    t2 = time.time()

    print(t2-t1)

In my system, it runs for 142 secs.

Now, we can take the multiprocessing approach. The idea can be summarized in the following illustration.

GIF credit: Google

We make some segments (based on how many cpu cores we have) and process those segmented frames in parallel.

import cv2
import sys
import time

import numpy as np
import subprocess as sp
import multiprocessing as mp

filename = '2.mp4'

def process_video(group_number):    
    cap = cv2.VideoCapture(filename)
    num_processes = mp.cpu_count()
    frame_jump_unit = cap.get(cv2.CAP_PROP_FRAME_COUNT) // num_processes
    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_jump_unit * group_number)
    proc_frames = 0

    mse = []
    prev_frame = None
    while proc_frames < frame_jump_unit:
        ret, frame = cap.read()
        if ret == False:
            break

        if not (prev_frame is None):
            c_mse = np.mean(np.square(prev_frame-frame))
            mse.append(c_mse)

        prev_frame = frame

        proc_frames += 1

    np.save('data/' + str(group_number) + '.npy', np.array(mse))

    cap.release()
    return


if __name__ == "__main__":

    t1 = time.time()

    num_processes =  mp.cpu_count()
    print(f'CPU: {num_processes}')

    # only meta-data
    cap = cv2.VideoCapture(filename)

    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_jump_unit = cap.get(cv2.CAP_PROP_FRAME_COUNT) // num_processes
    cap.release()

    p = mp.Pool(num_processes)
    p.map(process_video, range(num_processes))

    # merging



    # the missing mse will be 

    final_mse = []
    for i in range(num_processes):
        na = np.load(f'data/{i}.npy')
        final_mse.extend(na)


        try:
            cap = cv2.VideoCapture(filename) # you could also take it outside the loop to reduce some overhead
            frame_no = (frame_jump_unit) * (i+1) - 1
            print(frame_no)
            cap.set(1, frame_no)
            _, frame1 = cap.read()
            #cap.set(1, ((frame_jump_unit) * (i+1)))
            _, frame2 = cap.read()
            c_mse = np.mean(np.square(frame1-frame2))
            final_mse.append(c_mse)
            cap.release()
        except:
            print('failed in 1 case')
            # in the last few frames, nothing left
            pass




    t2 = time.time()

    print(t2-t1)

    np.save(f'data/final_mse.npy', np.array(final_mse))

I'm using just numpy save to save the partial results, you can try something better.

This one runs for 49.56 secs with my cpu_count = 12. There are definitely some bottlenecks that can be avoided to make it run faster.

The only issue with my implementation is, it's missing the mse for regions where the video was segmented, it's pretty easy to add. As we can index individual frames at any location with OpenCV in O(1), we can just go to those locations and calculate mse separately and merge to the final solution. [Check the updated code it fixes the merging part]

You can write a simple sanity check to ensure, both provide the same result.

import numpy as np

a = np.load('data/sp.npy')

b = np.load('data/final_mse.npy')

print(a.shape)

print(b.shape)

print(a[:10])

print(b[:10])

for i in range(len(a)):
    if a[i] != b[i]:
        print(i)

Now, some additional speedups can come from using a CUDA-compiled opencv, ffmpeg, adding queuing mechanism plus multiprocessing, etc.

Many thanks for your very helpful answer!! 1.To add the missing mse due to segmentation, would you do it in the “merging” part? 2. If I want to introduce queuing, I would do it within the process video function, right? 3. Could I save also some time if I calculate the frame jump outside of the function, and add it as parameter to the function? — henry, May 01 '20 at 06:28
@henry check the updated answer, added the merge part. 2. Yes, that would make sure the full utilization of each core. 3. you can, but the gain would be absolutely not worth it, as you have to replace the `map` with something complex like `starmap` or need to use shared variables. The `frame_count`, `jump` these are all O(1) operations, so it's not adding any reasonable time complexity. — Zabir Al Nazi, May 01 '20 at 13:30
This is very nice ! Thank you very much for your great answer !! — henry, May 01 '20 at 14:52

Multi process Video Processing

1 Answers1