Okay, assume you have these two tasks:
import time
def cpu_operation(n):
print('Start CPU', n)
for x in range(100):
time.sleep(0.01)
print('End CPU', n)
return n
def expensive_gpu_operation(n):
print('Start GPU', n)
time.sleep(0.3)
print('Stop GPU', n)
return n
Here's how you run them now:
def slow():
results = []
for task in range(5):
cpu_result = cpu_operation(task)
gpu_result = expensive_gpu_operation(cpu_result)
results.append(gpu_result)
return results
We run these in sequence - CPU, GPU, CPU, GPU... Output is like:
Start CPU 0
End CPU 0
Start GPU 0
Stop GPU 0
Start CPU 1
End CPU 1
Start GPU 1
Stop GPU 1
Start CPU 2
End CPU 2
Start GPU 2
Stop GPU 2
Start CPU 3
End CPU 3
Start GPU 3
Stop GPU 3
Start CPU 4
End CPU 4
Start GPU 4
Stop GPU 4
Assumption is we could save some time by starting CPU task X+1 before GPU task X completes, so that CPU X+1 and GPU X go in parallel, right?
(We can't run CPU X and GPU X in parallel because GPU X needs input from CPU X's output, hence the +1.)
Let's use threads! Basically we want to do something like:
- start CPU N, wait for it to finish
- wait for GPU N-1 to finish, start GPU N in background
So we get some parallelism. Simplest way to implement that is a thread pool with 1 thread - it can act like a queue. In each loop, we'll just schedule a task and store the async_result
. When we're done, we'll be able to retrieve all the results.
Incidentally, Python has a thread pool implementation in the multiprocessing
module.
from multiprocessing.pool import ThreadPool
def quick():
pool = ThreadPool(processes=1)
results = []
for task in range(5):
cpu_result = cpu_operation(task)
# schedule next GPU operation in background,
# store the async_result instance for this operation
async_result = pool.apply_async(expensive_gpu_operation, (cpu_result, ))
results.append(async_result)
# The results are ready! (Well, the last one probably isn't yet,
# but get() will wait for it
return [x.get() for x in results]
Now the output becomes:
Start CPU 0
End CPU 0
Start CPU 1
Start GPU 0
Stop GPU 0
End CPU 1
Start CPU 2
Start GPU 1
Stop GPU 1
End CPU 2
Start CPU 3
Start GPU 2
Stop GPU 2
End CPU 3
Start CPU 4
Start GPU 3
Stop GPU 3
End CPU 4
Start GPU 4
Stop GPU 4
We can observe parallelism!
Note that when the expensive_gpu_operation
gets scheduled, it doesn't actually run until time.sleep
inside the next CPU operation. This is due to the Global Interpreter Lock - the main thread has to give up the GIL before the worker thread has a chance to do something, here this happens on time.sleep()
, in your case I expect it will happen when you'll do some i/o - start reading the next batch of images.