Multithreading degrades GPU performance

Question

In my Python application I am using Detectron2 to run prediction on an image and detect the key-points of all the humans in the image.

I want to run the prediction on frames that are streamed to my app live (using aiortc), but I discovered that the predictions time is much worse because it now runs on a new thread (the main thread is occupied with the server).

Running predictions on a thread takes anywhere between 1.5 to 4 seconds, which is a lot.

When running the predictions on the main-thread (without the video streaming part), I get predictions times of less than a second.

My question is why it happens and how can I fix it¿ Why the GPU performance is degraded so drastically when using it from a new thread¿

Notes:

The code is tested in Google Colab with Tesla P100 GPU and the video streaming is emulated by reading frames from a video file.
I calculate the time it takes to run prediction on a frame using the code in this question.

I tried switching to multiprocessing instead, but couldn't make it work with cuda (I tried both import multiprocessing as well as import torch.multiprocessing with set_stratup_method('spawn')) it just gets stuck when calling start on the process.

Example code:

from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg

import threading
from typing import List
import numpy as np
import timeit
import cv2

# Prepare the configuration file
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7  # set threshold for this model
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml")

cfg.MODEL.DEVICE = "cuda"
predictor = DefaultPredictor(cfg)


def get_frames(video: cv2.VideoCapture):
    frames = list()
    while True:
        has_frame, frame = video.read()
        if not has_frame:
            break
        frames.append(frame)
    return frames

class CodeTimer:
    # Source: https://stackoverflow.com/a/52749808/9977758
    def __init__(self, name=None):
        self.name = " '" + name + "'" if name else ''

    def __enter__(self):
        self.start = timeit.default_timer()

    def __exit__(self, exc_type, exc_value, traceback):
        self.took = (timeit.default_timer() - self.start) * 1000.0
        print('Code block' + self.name + ' took: ' + str(self.took) + ' ms')

video = cv2.VideoCapture('DemoVideo.mp4')
num_frames = round(video.get(cv2.CAP_PROP_FRAME_COUNT))
frames_buffer = list()
predictions = list()

def send_frames():
    # This function emulates the stream, so here we "get" a frame and add it to our buffer
    for frame in get_frames(video):
        frames_buffer.append(frame)
        # Simulate delays between frames
        time.sleep(random.uniform(0.3, 2.1))

def predict_frames():
    predicted_frames = 0  # The number of frames predicted so far
    while predicted_frames < num_frames:  # Stop after we predicted all frames
        buffer_length = len(frames_buffer)
        if buffer_length <= predicted_frames:
            continue  # Wait until we get a new frame

        # Read all the frames from the point we stopped
        for frame in frames_buffer[predicted_frames:]:
            # Measure the prediction time
            with CodeTimer('In stream prediction'):
                predictions.append(predictor(frame))
            predicted_frames += 1


t1 = threading.Thread(target=send_frames)
t1.start()
t2 = threading.Thread(target=predict_frames)
t2.start()
t1.join()
t2.join()

I have three questions/suggestions:1. I do not understand how you use the threads, because it looks like you currently have one thread that runs both the detection and the `get_frames` function. It would make sense for me to have one thread to fill a buffer with images, and another thread to process the images. — Thijs Ruigrok, Oct 15 '21 at 17:19
2. Can you check if the detection model is fully initialized before you turn it into a thread. Usually the detection model requires a longer time(a few seconds) to process the first frame. You can try to let the model process a dummy frame/empty mage directly after initializing(after this line `predictor = DefaultPredictor(cfg)`). 3. Can you check that the detection model is run on the GPU. I do not see code that moves your model or your image to the GPU. Maybe this is done within the `DefaultPredictor`. However I cannot tell for sure. — Thijs Ruigrok, Oct 15 '21 at 17:20
@ThijsRuigrok 1. You are right, I have just now noticed I oversimplified my example code, it suppose to send the frames on another thread. 2. I tried that and it seems the it is indeed initialized but still runs slow. 3. In the `cfg` I specify that the predictor runs on `cuda` and the `DefaultPredictor` moves the frame to the GPU. — SagiZiv, Oct 16 '21 at 15:54
Sounds good. Are you 100% sure that the implementation of the threading in the real code is not causing any problems? Is it possible to share (a part of) the real code? — Thijs Ruigrok, Oct 18 '21 at 09:46
@ThijsRuigrok Unfortunately this is the most I am allowed to share… I have measured the time it took to run the prediction both on the main thread and on separate thread and as I wrote in the question, on the main thread the predictions runs much faster. Therefore I think the cause is the threading. — SagiZiv, Oct 18 '21 at 13:24
I'm sorry, since the threading in your example is differently implemented than the threading in your code I cannot help you with your version of the code. — Thijs Ruigrok, Oct 19 '21 at 08:56
@ThijsRuigrok Thank you, I have updated the code to be more similar to the real code. — SagiZiv, Oct 19 '21 at 10:07
Tnx for updating the code. You code seems logical considering the threading part. I notice that you never clear the frame buffer. In case of a large video/image stream this might soak up a lot of ram which can slow down your system or even crash it (happened to me when I loaded a 4 minute video consisting of 7200 frames). — Thijs Ruigrok, Oct 19 '21 at 15:24
Sadly I could not reproduce your problem since I do not have a GPU currently available. However, I took a look in the detectron2 code and found that the `DefaultPredictor` class is not recommended for implementation of more advanced inference operations. I also found a `class AsyncPredictor` in line 132 in https://github.com/facebookresearch/detectron2/blob/cbbc1ce26473cb2a5cc8f58e8ada9ae14cb41052/demo/predictor.py This class uses an task que to asynchronously process images. This might be a more efficient implementation for solving your problem. let me know what you think. — Thijs Ruigrok, Oct 19 '21 at 15:27
@ThijsRuigrok Tnx, it seems that the `AsyncPredictor` is in the demo package so its not as easy to import. I am checking if I even have that package when I install detectron — SagiZiv, Oct 19 '21 at 16:11
@ThijsRuigrok Well, either I am using it wrong or that is the reason its only in the demo, my Colab session crashes with the error `Your session crashed after using all available RAM`. It seems to keep loading and loading stuff without printing anything. And I also tried to stop the prediction after 10 frames instead of the entire video but git the same results... — SagiZiv, Oct 19 '21 at 16:36

Victor Paléologue · Answer 1 · 2021-10-19T17:29:11.013

1

Python threads rely on the GIL which must be locked by all C bindings trying to access Python objects. GPU computing libraries typically use C bindings, and could potentially lock the GIL from time to time and thus pause Python code execution.

It is a wild guess, but this is possible that the predictor function, which needs to go through C and a lock of the GIL finds itself waiting for the other threads that are writing the video buffers. Then depending on how the computation is broken down and how Python juggles with your other thread, I suppose the impact on performance may become visible.

You may:

avoid multi-threading by performing the reading and the prediction in the same thread.
use multi-processing so that the GIL does not interfere between the two processes
code this in a native language such as C, C++...

edited Oct 19 '21 at 17:29

answered Oct 19 '21 at 14:47

Victor Paléologue

2,025
1
17
27

Interesting… And is there a way to overcome it¿ I tried to use processes instead of thread but the program simply stops responding for unknown reason. – SagiZiv Oct 19 '21 at 14:53
The multi-process solution seems legit, but I cannot help on why it does not work for you. The alternative would be to do everything from the main thread, but your framerate will depend on the performance of the predictor. For instance `get_frames` might drop unread frames when its circular buffer is full, making your system skip frames. Last alternative: don't code this in Python, but in a native language. – Victor Paléologue Oct 19 '21 at 14:59
1

This answer feels just inaccurate enough to be misleading. Python does use regular OS-level threads, it does not emulate them. The purpose of the GIL is to protect modification of *Python* objects – compiled code ("C binding") and especially GPU code usually does not do so and thus *does not* hold the GIL. Even if the GIL is contended, switching is on the order of 0.005s which should be pretty even across two threads – that's much, much less than what is observed as slowdown in the question. – MisterMiyagi Oct 19 '21 at 15:01
Interesting idea to run it on main thread, but I have the server itself running on that thread (its my first time building such application, so sorry if its unconventional). Changing programing language means we can't use the python library we are using right now and disposing what we did so far in python – SagiZiv Oct 19 '21 at 15:03
Thanks @MisterMiyagi for the details. I think you are right that most GPU operations do not need the GIL to be locked, but with the lack of details, we cannot exclude it. I'm fixing my point on emulated threads, which I misinterpreted from "loosely based on Java’s threading model" in the official doc. – Victor Paléologue Oct 19 '21 at 15:15
1

-I can't avoid multi-threading because the frames would always come from another thread and I prefer not to add code to this thread that might slow it down and make it miss some frames. -Tried multi-processing it just froze, I got no response from the application. -Code in another language is probably better, but it would require me to change a lot of code and find an equivalent library to do the predictions. – SagiZiv Oct 19 '21 at 17:42

Thijs Ruigrok · Answer 2 · 2021-10-20T08:53:00.247

The problem is in: your hardware, your libraries or, in the differences between your example code and the real code.

I implemented your code on an Nvidia Jetson Xavier. I installed all needed libraries using the following commands:

# first create your virtual env
virtualenv -p python3 detectron_gpu
source detectron_gpu/bin/activate

#torch for jetson
wget https://nvidia.box.com/shared/static/p57jwntv436lfrd78inwl7iml6p13fzh.whl -O torch-1.8.0-cp36-cp36m-linux_aarch64.whl
sudo apt-get install python3-pip libopenblas-base libopenmpi-dev 
pip3 install Cython
pip3 install numpy torch-1.8.0-cp36-cp36m-linux_aarch64.whl

# torchvision
pip install 'git+https://github.com/pytorch/vision.git@v0.9.0'

# detectron
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

# ipython bindings (optional)
pip install ipykernel cloudpickle 

# opencv
pip install opencv-python

After that I run your example script on an example video and received the following output:

Code block 'In stream prediction' took: 2932.241764000537 ms
Code block 'In stream prediction' took: 409.69691300051636 ms
Code block 'In stream prediction' took: 410.03823099981673 ms
Code block 'In stream prediction' took: 409.4023269999525 ms

After the first pass, the detector consistently takes around 400ms to run the detection. Which seems about right for an Jetson Xavier. I do not experience the slowdown you described.

I have to note that the Jetson is a specific piece of hardware. In this hardware the RAM memory is shared between the CPU and the GPU. Therefore I do not have to transfer the data from CPU to GPU. So if your slow down is caused by the transfer between CPU and GPU memory, I will not experience this problem in my setup.

This is interesting... I ran this example code both on `Colab Pro` and `AWS EC2 instance with T4 GPU` and got timings of about 800 to 1200 ms, so it is possible the real code might add to the slow-down, but it is still much slower compared to running the prediction on the main thread (without any other threads) which results is 400ms on average. Thank you very much for the help — SagiZiv, Oct 19 '21 at 20:47
I have added code on a new question that is able to reproduce the slowness seen here. https://stackoverflow.com/questions/70967366/multithreading-slower-on-detectron2-inferencing-with-cv2-videocapture — Austin Ulfers, Feb 03 '22 at 07:35

score -1 · Answer 3 · answered Oct 19 '21 at 14:42

-1

Not seeing the full code, here is a few suggestions:

You might be running into overhead of starting new threads every time. So explore option of the thread pool instead of starting new threads every time.
If you are not moving workload to GPU - that means it's CPU bound task and Python threads is not the right tool for the task. For CPU intensive tasks you should be using https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing

answered Oct 19 '21 at 14:42

Dmitry Zayats

474
2
6

1

1) I am creating only 2 threads - one for video stream and one for the predictions 2) The frame buffer is on CPU, but every frame is moved to the GPU by the `predictor` object – SagiZiv Oct 19 '21 at 14:48
And as I wrote in the question, multi-processing doesn't work for some reason – SagiZiv Oct 19 '21 at 14:50

Pommepomme · Answer 4 · 2021-10-19T15:36:12.093

-2

Some operations are I/O bound. For example, each cv2.imread call results in I/O overhead. You can read this article which says : "Not all algorithms can be made parallel and distributed to all cores of a processor — some algorithms are simply single threaded in nature."

This means that multiprocessing for computer vision algorithms must be global: a single operation (such as imread) will not be improved by multithreading. However, you will sometimes gain speed by performing other operations in parallel because they are not limited by I/O or anything else. At this point, you will probably see an overall speedup:

If you run single imread:

non-multithreaded: 5 ms = cost of imread
multithreaded: 7 ms = cost of multithreading + cost of imread

But if you run operations that can be multithreaded :

non multithreaded: 5 ms + 10 ms = cost of imread + cost of operation
multi-threaded: 2ms + 5 ms + 5 ms = cost of multithreading + cost of imread + cost of parallel operations

(these figures are not real, they are just to illustrate what I mean)

edited Oct 19 '21 at 15:36

answered Oct 19 '21 at 14:58

Pommepomme

115
1
11

I am using CV2 to read a video file just as an example because I can't sure the video streaming part. In the real code, I don't have a video file – SagiZiv Oct 19 '21 at 15:00
I know, I just edited the message. My post was only there to explain a bit about why your program may be slower with multithreading. There are a ton of functions or operations in your external libraries that can be non-parallel. The imread function was also an example, there are other functions like imread which can results with I/O overrhead. Unfortunately, it seems pretty hard to define which ones – Pommepomme Oct 19 '21 at 15:08
I don't see how this applies to the scenario shown in the question. Can you please clarify? Doing an I/O bound operation, namely reading the frames, and a compute bound operation, namely image recognition, is precisely what the question scenario already does. Thus, this answer seems to suggest it should be *faster* with multithreading. – MisterMiyagi Oct 19 '21 at 15:08
No, my answer only suggests that if you do only operations which are not parallelisable, you're program will be slower with multi-thread rather that single thread. But, if in your code you use other operations parallelisable, you will globally gain time as you increase the thread numbers, but not necessarily if you're operations are not parallelisable – Pommepomme Oct 19 '21 at 15:12

Multithreading degrades GPU performance

4 Answers4

Linked