2

This post is nothing about deep learning, but on some heavy lifting around it.

While training neural networks, especially if working with high resolution images, there is a repetitive process of loading images from storage (SSD/HDD) and loading it to RAM, then it gets fed into the GPU for the training process.

There is a lot of time where the GPU is doing all the work while the CPU is pretty idle, so I was thinking, is there a way to load the next batch of images to RAM while the GPU is working? Because if I'm not mistaken what happens now is that the CPU loads the images from storage, transfers them into the GPU, the GPU does its thing and then the GPU has to wait for the CPU to load new images from storage.

How could we code a generator that will retrieve new images to RAM while the GPU is working?

bluesummers
  • 11,365
  • 8
  • 72
  • 108
  • would it make sense to offload the GPU work to a different thread so that your main thread can continue? – Kos Nov 20 '17 at 16:24
  • Possible duplicate of [Asynchronously read and process an image in python](https://stackoverflow.com/questions/12474182/asynchronously-read-and-process-an-image-in-python) – timday Nov 20 '17 at 17:26

1 Answers1

2

Okay, assume you have these two tasks:

import time


def cpu_operation(n):
    print('Start CPU', n)
    for x in range(100):
        time.sleep(0.01)
    print('End CPU', n)
    return n


def expensive_gpu_operation(n):
    print('Start GPU', n)
    time.sleep(0.3)
    print('Stop GPU', n)
    return n

Here's how you run them now:

def slow():
    results = []
    for task in range(5):
        cpu_result = cpu_operation(task)
        gpu_result = expensive_gpu_operation(cpu_result)
        results.append(gpu_result)
    return results

We run these in sequence - CPU, GPU, CPU, GPU... Output is like:

Start CPU 0
End CPU 0
Start GPU 0
Stop GPU 0
Start CPU 1
End CPU 1
Start GPU 1
Stop GPU 1
Start CPU 2
End CPU 2
Start GPU 2
Stop GPU 2
Start CPU 3
End CPU 3
Start GPU 3
Stop GPU 3
Start CPU 4
End CPU 4
Start GPU 4
Stop GPU 4

Assumption is we could save some time by starting CPU task X+1 before GPU task X completes, so that CPU X+1 and GPU X go in parallel, right?

(We can't run CPU X and GPU X in parallel because GPU X needs input from CPU X's output, hence the +1.)

Let's use threads! Basically we want to do something like:

  • start CPU N, wait for it to finish
  • wait for GPU N-1 to finish, start GPU N in background

So we get some parallelism. Simplest way to implement that is a thread pool with 1 thread - it can act like a queue. In each loop, we'll just schedule a task and store the async_result. When we're done, we'll be able to retrieve all the results.

Incidentally, Python has a thread pool implementation in the multiprocessing module.

from multiprocessing.pool import ThreadPool

def quick():
    pool = ThreadPool(processes=1)
    results = []
    for task in range(5):
        cpu_result = cpu_operation(task)
        # schedule next GPU operation in background,
        # store the async_result instance for this operation
        async_result = pool.apply_async(expensive_gpu_operation, (cpu_result, ))
        results.append(async_result)

    # The results are ready! (Well, the last one probably isn't yet,
    # but get() will wait for it
    return [x.get() for x in results]

Now the output becomes:

Start CPU 0
End CPU 0
Start CPU 1
Start GPU 0
Stop GPU 0
End CPU 1
Start CPU 2
Start GPU 1
Stop GPU 1
End CPU 2
Start CPU 3
Start GPU 2
Stop GPU 2
End CPU 3
Start CPU 4
Start GPU 3
Stop GPU 3
End CPU 4
Start GPU 4
Stop GPU 4

We can observe parallelism!


Note that when the expensive_gpu_operation gets scheduled, it doesn't actually run until time.sleep inside the next CPU operation. This is due to the Global Interpreter Lock - the main thread has to give up the GIL before the worker thread has a chance to do something, here this happens on time.sleep(), in your case I expect it will happen when you'll do some i/o - start reading the next batch of images.

Kos
  • 70,399
  • 25
  • 169
  • 233
  • Please let me know if this approach worked for you! If the thread pool is confusing, we could perhaps start a thread explicitly but it will take more code. – Kos Nov 20 '17 at 17:13
  • This looks pretty straight up, I actually intend to implement a general wrapper for data loading for future deep learning assignments, if anything goes wrong I'll update here – bluesummers Nov 20 '17 at 20:19