Encode np.uin8 arrays to video using ffmpeg+CUDA

Question

I am trying to take advantage of the video encoding capabilities of my Nvidia GPU by using it, instead of CPUs, to save a stream of numpy arrays to an .mp4 or .avi file. With this I indend to:

release work from my CPUs, so they can do other stuff at the same time
potentially, speed up the encoding

In order to do that, I have created a sample repository that implements this functionality. The ffmpeg call that uses CUDA looks as follows:

ffmpeg -y -f rawvideo -pix_fmt rgb24 -vsync 0 -extra_hw_frames 2 -s 2000x2000 -r 45 -i - -an -c:v h264_nvenc output.mp4

As you see, ffmpeg receives data from standard input. This standard input is provided by the ffmpeg_gpu_benchmark.py script. However, even when running the CUDA flag -c:v h264_nvenc, I observe ffmpeg and python still take up a lot of CPU time. Why could that be?

For the sake of completeness, here is the relevant code

def run_v2(camera, proc, args, pb, data=None):
    try:
        if data is None:
            frame = read_frame(camera=camera, color=args.color, height=args.height, width=args.width)
            data = frame
        
        write_to_ffmpeg(proc, data)
        if pb: pb.update(1)
        if  args.preview:
            cv2.imshow("frame", cv2.resize(frame, (300, 300)))  
            if cv2.waitKey(1) == ord("q"):
                raise QuitException
        return 0
    except (KeyboardInterrupt, QuitException):
        stop_camera(camera)
        return 1

where proc is a subprocess.Popen is created like so

proc = subprocess.Popen(
        cmd,
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        shell=False
    )

and write_to_ffmpeg is just running this

# data is an np.array with dtype np.uint8
proc.stdin.write(data)

This is True even when I set frame to be a constant random frame created with np.random.randint when the program starts. So it's not latency due to frame acquiring.

PS I am doing this because unfortunately the CUDA based VideoWriter class from OpenCV is only supported on Windows and not Linux

A couple observations. (1) you don't need `stdout=subprocess.PIPE` (just sayin') (2) Have you tried to reencode your encoded video with hwaccel ffmpeg? That is, if you remove all Python clutter from your problem, does it still require high CPU utilization? (make sure to use HW decoder as well) (3) The other way around, how busy is it if you comment out `write_to_ffmpeg` line? (4) Can you feed the camera feed straight to ffmpeg and perhaps split its output to stdout? Wonder if that improves anything. — kesh, Feb 22 '22 at 20:08
1) Right, I copy-pasted that but indeed it's not needed! 2) and 3) I think the main cpu usage was driven by the conversion to grayscale, as mentioned by @Rotem 4) I think I can't because it's a Basler camera. I only know how to operate it from a Python program — antortjim, Mar 01 '22 at 14:49
Can you set your [camera image format to `Mono8`](https://docs.baslerweb.com/pixel-format#))? I bet that further reduces the CPU load as it only needs to receive grayscale data from USB — kesh, Mar 01 '22 at 17:07
Neither `mono8` nor `Mono8` seem to work on my side. I get the error `[rawvideo @ 0x559b02365840] No such pixel format: mono8. pipe:: Invalid argument ` — antortjim, Mar 01 '22 at 17:16
Aa ok sorry you mean on the Basler side! Yes let me try. Indeed, it is already `Mono8`, according to Pylon Viewer — antortjim, Mar 01 '22 at 17:17
So it's mono but `camera._next_image()` returns RGB? I wonder if it's actually doing the unnecessary grayscale-to-RGB conversion. — kesh, Mar 01 '22 at 17:25
One last thing, I was browsing `pypylon` repo (which isn't what you're using?) and noticed they have `grabResult.GetArrayZeroCopy()`. I was wondering if this retrieval option could help you. — kesh, Mar 01 '22 at 17:27
I am using [this](https://github.com/shaliulab/baslerpi), a Python library we have in the lab to interact with Basler cameras, abstracting away the pypylon logic. But inded, it is based on pypylon. I wrote it some time ago and I know it's not clean, sorry about that! You can follow by checking `camera._next_image()`, which calls `_next_image_raw()`, which calls `grab()` and eventually in there I: `grabResult = RetrieveResult(timeout, pylon.TimeoutHandling_ThrowException)` — antortjim, Mar 01 '22 at 17:41
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/242523/discussion-between-antortjim-and-kesh). — antortjim, Mar 01 '22 at 17:43
It appears that you can set it from Python `camera.PixelFormat = "Mono8"`. — kesh, Mar 01 '22 at 18:01

score 2 · Answer 1 · answered Feb 22 '22 at 22:24

When executing your code sample I could identify two issues:

frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) has high CPU utilization.
The frame rate of the camera simulation is not 45fps.

Since I don't have the Basler Camera, my answer address only the simulated camera.
As Kesh commented, you may apply hardware acceleration for Camera video decoding (in case the video is encoded).

For avoiding the OpenCV color conversion, we may create the RGB frame up front, instead of converting each frame from Grayscale to RGB.

Replace: frame = np.random.randint(0, 256, (height, width, 1), dtype=np.uint8)
With: frame = np.random.randint(0, 256, (height, width, 3), dtype=np.uint8)

And replace

if color:
    frame = cv2.cvtColor(frame, cv2.COLOR_GRAY2BGR)

with:

if not color:
    frame = frame[:, :, 0]

In my machine it reduces the CPU utilization from 55% to about 30%.

I also tried creating NV12 frame format (up front), instead of RGB, using the code from my following post, but there was almost no affect.

The other issue is that the code is not simulating 45fps input source.

The default FFmpeg behavior is to encode frames as fast as possible.
Because there is not real camera, there is nothing that limits the input framerate to 45fps.

For simulating 45fps input framerate, add -re argument:

command = f"ffmpeg -y -re ...

The FFmpeg's "-re" flag means to "Read input at native frame rate. Mainly used to simulate a grab device." i.e. if you wanted to stream a video file, then you would want to use this, otherwise it might stream it too fast (it attempts to stream at line speed by default).

Adding -re reduced the CPU utilization to about 20%.

After adding -re, using NV12 (input) format, instead of RGB reduced the CPU utilization to about 9%.

For simulating random NV12 frame we may create frame with height*1.5 rows:

frame = np.random.randint(0, 256, (height*3//2, width), dtype=np.uint8)

You were absolutely right, I didn't thing about the cpu consumption that the grayscale conversion would take. Removing it lowers it significantly. I tried the `-re` flag and I found it to work best without it (and setting the framerate with `-r`. I have updated the repo and documented it https://github.com/shaliulab/ffmpeg-gpu-benchmark. — antortjim, Mar 01 '22 at 14:47

Encode np.uin8 arrays to video using ffmpeg+CUDA

1 Answers1