Unrolling a trivially parallelizable for loop in python with CUDA

Question

I have a for loop in python that I want to unroll onto a GPU. I imagine there has to be a simple solution but I haven't found one yet.

Our function loops over elements in a numpy array and does some math storing the result in another numpy array. Each iteration adds some to this result array. A possible large simplification of our code might look something like this:

import numpy as np

a = np.arange(100)
out = np.array([0, 0])
for x in xrange(a.shape[0]):
  out[0] += a[x]
  out[1] += a[x]/2.0

How can I unroll a loop like this in Python to run on a GPU?

How do I unroll a loop onto the GPU in python. What libraries should I use and what function calls. — deltap, Apr 04 '14 at 05:53

score 2 · Answer 1 · edited May 23 '17 at 12:24

The place to start is http://documen.tician.de/pycuda/ the example there is

import pycuda.autoinit
import pycuda.driver as drv
import numpy

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),
        block=(400,1,1), grid=(1,1))

print dest-a*b

You place the part of the code you want to parallelize in C code segment and call it from python.

For you example the size of your data will need to be much bigger than 100 to make it worth while. You'll need some way to divide your data into block. If you wanted to add 1,000,000 numbers you could divide it into 1000 blocks. Add each block in the parallezed code. Then add the results in python.

Adding things is not really a natural task for this type of parallelisation. GPUs tend to do the same task for each pixel. You have a task which need to operate on multiple pixels.

It might be better to work with cuda first. A related thread is. Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)

Unrolling a trivially parallelizable for loop in python with CUDA

1 Answers1