Let's say I have two neural networks represented as Python classes A and B. Methods A.run()
and B.run()
represent a feedforward inference for one image.
As an example, A.run()
takes 100 ms, and B.run()
takes 50 ms.
When ran one after another, i.e.
img = cap.read()[1] # e.g. cv2.VideoCapture instance
start_time = time.time()
A.run(img) # 100 ms
B.run(img) # 50 ms
time_diff = time.time() - start_time # 100 + 50 = 150 ms
the inference times just add up to 150 ms.
To make this faster, we can try parallelizing so that they start at the same time. An implementation that uses Python's threading is outlined below:
class A:
# This method is spawned using Python's threading library
def run_queue(self, input_queue, output_queue):
while True:
img = input_queue.get()
start_time = time.time()
output = self.run(img)
time_diff = time.time() - start_time # Supposedly 100 ms for class A, and 50 ms for class B
# in main program flow:
# Assume that a_input_queue and a_output_queue are tied to an instance of class A
# And similar for class B
img = cap.read()[1]
a_input_queue.put(img)
b_input_queue.put(img)
start_time = time.time()
a_output = a_output_queue.get() # Should take 100 ms
b_output = b_output_queue.get() # B.run() should take 50 ms, but since it started at the same time as A.run(), this get() should effectively return immediately
time_diff = time.time() - start_time # Should theoretically be 100 ms
So theoretically, we should only be bottlenecked by A, and end up with 100 ms for the whole system.
However, it seems that B.run() takes around 100 ms as well when measured in B.run_queue()
. Since they started at around the same time, the whole system also takes around a 100 ms.
Does this make sense? Is trying to thread the two neural networks sensible, if the resulting total inference time is about the same (or possibly incrementally faster at least)?
My guess is that the GPU is maxed at 100% for one neural network, so when trying to inference two networks at the same time, it just rearranges the instructions but can only do the same number of computations anyway:
Illustration:
A.run() executes 8 blocks of instructions:
| X | X | X | X | X | X | X | X |
B.run() executes only 4 blocks of instructions:
| Y | Y | Y | Y |
Now, say that the GPU can process 2 blocks of instructions per second.
So, in the case that A.run() and B.run() are ran one after the other (non-threaded):
| X | X | X | X | X | X | X | X | Y | Y | Y | Y | -> A.run() takes 4 s, B.run() takes 2 s, everything takes 6 s
In the threaded case, the instructions are rearranged so both start at the same time, but get stretched out:
| X | X | Y | X | X | Y | X | X | Y | X | X | Y -> A.run() roughly takes 6 s, B.run() roughly takes 6 s, everything seems to take 6 s
Is the above illustration the case?
Finally, let's consider a class C similar to B (e.g. inference time=50 ms), except that it uses the CPU. Thus, it shouldn't compete with A in GPU usage, but from experiments, it just behaved like B; its inference time seemed to be stretched to match A's.
Thoughts? Thanks in advance.