How to instruct CuPy to run multiple number of the same job concurrently in a GPU?

Question

Below is a simple function called job() that performs several CuPy tasks in the GPU.

How do I instruct CuPy to run job() a million times concurrently and thereafter aggregate their results?

The intent of my question is to understand how to submit multiple concurrent jobs to one GPU via CuPy.

Test Script

import numpy as np
import cupy as cp

def job( nsamples ):
    # Do some CuPy tasks in GPU
    d_a = cp.random.randn( nsamples )
    d_b = cp.random.randint( -3, high=3, size=nsamples )
    d_result = ( d_a + d_b )
    d_hist, _ = cp.histogram( d_result, bins=cp.array([-3,-2,-1,0,1,2,3,4]) )
    std = cp.std( d_hist )
    return std

# Perform 1 job in GPU
nsamples = 10 #can be as large as tens to hundreds of thousands
std = job( nsamples, 0 )
print( 'std', std, type(std) )

Update:

# Create Cuda streams
d_streams = []
for i in range(0, 10):
    d_streams.append( cp.cuda.stream.Stream( non_blocking=True ) )

# Perform Concurrent jobs via Cuda Stream.
results = []
for stream in d_streams:
    with stream:
        results.append( job( nsamples ) )
print( 'results', results, len(results), type(std) )

After reading this Nvidia developer blog on Cuda Stream, this CuPy issue on Support CUDA stream with stream memory pool and this SOF question on CuPy Concurrency, I have tried the above which seems to work. However, I don't know how to see whether the jobs are running concurrently or serially.

Questions:

How do I profile Cupy's execution of the jobs in the GPU to evaluate my script is doing what I want? Ans: nvprof --print-gpu-trace python filename.py
Is there a limit on the number streams that I can issue (e.g. limited by some hardware) or is it "infinite"?

you're thinking task parallel. You want to think data parallel. — Robert Crovella, Sep 02 '20 at 20:43
@RobertCrovella can you elaborate or refer me to documentation on what you mean. Cheers. — Sun Bear, Sep 02 '20 at 20:46

score 1 · Accepted Answer · answered Sep 03 '20 at 15:59

My recommendation in general would be to concatenate all your data together (across jobs) and seek to complete the work in a data parallel way. Here's a rough example:

$ cat t34.py
import numpy as np
import cupy as cp

def job( nsamples, njobs ):
    # Do some CuPy tasks in GPU
    d_a = cp.random.randn( nsamples, njobs )
    d_b = cp.random.randint( -3, high=3, size=(nsamples, njobs) )
    d_result = ( d_a + d_b )
    mybins = cp.array([-3,-2,-1,0,1,2,3,4])
    d_hist = cp.zeros((njobs,mybins.shape[0]-1))
    for i in range(njobs):
      d_hist[i,:], _ = cp.histogram( d_result[i,:], bins=mybins )
    std = cp.std( d_hist, axis=1 )
    return std

nsamples = 10 #can be as large as tens to hundreds of thousands
std = job( nsamples, 2 )
print( 'std', std, type(std) )
$ python t34.py
std [0.69985421 0.45175395] <class 'cupy.core.core.ndarray'>
$

For most of the operations in job we can perform the appropriate cupy operation to take care of the work for all the jobs. To pick one example, the std function can readily extend to perform its work across all jobs. histogram is the exception, as that routine in numpy or cupy does not allow for a partitioned/segmented algorithm, that I can see. So I have used a loop for that. If this were the actual work you wanted to do, it might be possible to write a partitioned histogram cupy routine as a cupy kernel. Another alternative would be to issue just the cupy histogram in streams.

Wow! This is awesome. So simply adding another axis or dimension to represent 'njobs` in each `ndarray` object, I can compactly express what I need to do and better leverage the power of the `ndarray` object. I need time to digest the 2nd part of your explanation on the histogram. Is there a limit to the size of each axis? How do determine that? I do need to get a histogram on one of the axis of the results. Can you point me to an example(s) on how to write a partitioned-histogram-cupy-routine as a cupy kernel or issue the cupy histogram in streams? I would like to learn that. Thanks. — Sun Bear, Sep 03 '20 at 16:55
I'm not aware of general limits on axis sizes. I would generally start with the memory on your GPU as defining the upper bound of the array(s). I don't know of a tutorial on writing a cupy kernel histogram routine. I would try to learn these things: 1. How to write a histogram in CUDA, such as [here](https://developer.nvidia.com/blog/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/) 2. How to write a segmented algorithm in CUDA 3. How to write a [cupy kernel](https://docs.cupy.dev/en/stable/tutorial/kernel.html) and then combine all that knowledge. — Robert Crovella, Sep 03 '20 at 18:10

How to instruct CuPy to run multiple number of the same job concurrently in a GPU?

1 Answers1