Python Multiprocessing with PyCUDA

Question

I've got a problem that I want to split across multiple CUDA devices, but I suspect my current system architecture is holding me back;

What I've set up is a GPU class, with functions that perform operations on the GPU (strange that). These operations are of the style

for iteration in range(maxval):
    result[iteration]=gpuinstance.gpufunction(arguments,iteration)

I'd imagined that there would be N gpuinstances for N devices, but I don't know enough about multiprocessing to see the simplest way of applying this so that each device is asynchronously assigned, and strangely few of the examples that I came across gave concrete demonstrations of collating results after processing.

Can anyone give me any pointers in this area?

UPDATE Thank you Kaloyan for your guidance in terms of the multiprocessing area; if CUDA wasn't specifically the sticking point I'd be marking you as answered. Sorry.

Perviously to playing with this implementation, the gpuinstance class initiated the CUDA device with import pycuda.autoinit But that didn't appear to work, throwing invalid context errors as soon as each (correctly scoped) thread met a cuda command. I then tried manual initialisation in the __init__ constructor of the class with...

pycuda.driver.init()
self.mydev=pycuda.driver.Device(devid) #this is passed at instantiation of class
self.ctx=self.mydev.make_context()
self.ctx.push()

My assumption here is that the context is preserved between the list of gpuinstances is created and when the threads use them, so each device is sitting pretty in its own context.

(I also implemented a destructor to take care of pop/detach cleanup)

Problem is, invalid context exceptions are still appearing as soon as the thread tries to touch CUDA.

Any ideas folks? And Thanks to getting this far. Automatic upvotes for people working 'banana' into their answer! :P

Is the `gpuinstance.gpufunction(arguments,iteration)` asynchronous or does it block execution? — ktdrv, May 05 '11 at 22:41

talonmies · Accepted Answer · 2016-09-01T07:10:32.257

21

You need to get all your bananas lined up on the CUDA side of things first, then think about the best way to get this done in Python [shameless rep whoring, I know].

The CUDA multi-GPU model is pretty straightforward pre 4.0 - each GPU has its own context, and each context must be established by a different host thread. So the idea in pseudocode is:

Application starts, process uses the API to determine the number of usable GPUS (beware things like compute mode in Linux)
Application launches a new host thread per GPU, passing a GPU id. Each thread implicitly/explicitly calls equivalent of cuCtxCreate() passing the GPU id it has been assigned
Profit!

In Python, this might look something like this:

import threading
from pycuda import driver

class gpuThread(threading.Thread):
    def __init__(self, gpuid):
        threading.Thread.__init__(self)
        self.ctx  = driver.Device(gpuid).make_context()
        self.device = self.ctx.get_device()

    def run(self):
        print "%s has device %s, api version %s"  \
             % (self.getName(), self.device.name(), self.ctx.get_api_version())
        # Profit!

    def join(self):
        self.ctx.detach()
        threading.Thread.join(self)

driver.init()
ngpus = driver.Device.count()
for i in range(ngpus):
    t = gpuThread(i)
    t.start()
    t.join()

This assumes it is safe to just establish a context without any checking of the device beforehand. Ideally you would check the compute mode to make sure it is safe to try, then use an exception handler in case a device is busy. But hopefully this gives the basic idea.

edited Sep 01 '16 at 07:10

answered May 06 '11 at 07:57

talonmies

70,661
34
192
269

1

@talonmies as always, thanks, but quick query: If I understand this correctly, each thread is 'instantiated', executed, and joined in line. Does this not cause execution to run serially? I assume that the easy fix is to break the `t.join()`s into a separate loop. – Bolster May 06 '11 at 15:54
@Andrew Bolter: Yeah, I guess the start methods should be all called in a loop, and the joins all called later. I was wondering a little about the global interpreter lock in that situation too... I must confess I used mpi4py for my python multi-gpu, I have a pthreads framework I use for multi-gpu as well, but usually only with C/C++ and Fortran. – talonmies May 06 '11 at 16:10
@Andrew Bolter: I just added a little bit of instrumentation to a modified version of that code I posted and I am beginning to wonder at the sanity of using python threading for this. I would not like to bet on the correctness of what I posted at this point.... – talonmies May 06 '11 at 16:58
I suspect I'm going to refactor the problem with an aim to MPI, but it strikes me that this should be more trivial. Also, to circle around the threading deficiencies I've also been looking at multiprocessing instead. – Bolster May 06 '11 at 17:33
Also, I don't quite understand your 'pre-4.0' comment, as I understood it the previous context relevant multi-device operation was still supported? – Bolster May 06 '11 at 18:23
In cuda 4.0 one thread can hold more than one GPU context, and you just use context selection prior to any operation to use any given GPU. Prior to 4.0 it was 1 host thread per GPU context. The problem here is probably that although a python thread is a pthread, it still relies on the parent thread interpreter, which might not be enough for CUDA thread safety, pre CUDA 4.0 – talonmies May 06 '11 at 18:37
That's what I'd read in the PG, but I assume the way around that is declarative selection of devices? (as in your answer) I'll do some experiments. Thanks again. – Bolster May 06 '11 at 19:35
@talonmies, after fiddling as discussed, still getting invalid contexts (with or without additional context push/pop-ing) Looking at mpi4py now but would like to understand why this isn't working as imagined. disclaimer: I'm running on 4.0 – Bolster May 07 '11 at 00:50
Turns out the driver.init()s have to also be in the run function. ref:http://article.gmane.org/gmane.comp.python.cuda/1539/match=multi+gpu+threading but since this is more or less the behaviour of autoinit, using that for the time being, but what I want to move towards is effectively instantiating 4 gpu objects, that I can call multiple functions on over and over again, if that makes sense? – Bolster May 07 '11 at 01:31
It does make sense. The way most people do multi-gpu is to setup persistent threads, each holding its own context, which get send work as many times as required over the life of the application. That link is multiprocessing based, so it will be different to threading, because there you are running different processes, which don't share an interpreter. – talonmies May 07 '11 at 06:35
This wiki link shows how to do it threading - http://wiki.tiker.net/PyCuda/Examples/MultipleThreads and it does seem to work. You can probably use threading mutex and semaphore primitives to feed work to them, and have them call you single gpu work instances as required. – talonmies May 07 '11 at 07:00

score 3 · Answer 2 · answered May 05 '11 at 23:12

What you need is a multi-threaded implementation of the map built-in function. Here is one implementation. That, with a little modification to suit your particular needs, you get:

import threading

def cuda_map(args_list, gpu_instances):

    result = [None] * len(args_list)

    def task_wrapper(gpu_instance, task_indices):
        for i in task_indices:
            result[i] = gpu_instance.gpufunction(args_list[i])

    threads = [threading.Thread(
                    target=task_wrapper, 
                    args=(gpu_i, list(xrange(len(args_list)))[i::len(gpu_instances)])
              ) for i, gpu_i in enumerate(gpu_instances)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

    return result

It is more or less the same as what you have above, with the big difference being that you don't spend time waiting for each single completion of the gpufunction.

Thank you for your comment, and its guided me towards a solution, but its come up against CUDA-related issues regarding device contexts. Updating question to reflect this now — Bolster, May 06 '11 at 01:45

Python Multiprocessing with PyCUDA

2 Answers2

Linked