Below is a simple function called job()
that performs several CuPy tasks in the GPU.
How do I instruct CuPy to run job()
a million times concurrently and thereafter aggregate their results?
The intent of my question is to understand how to submit multiple concurrent jobs to one GPU via CuPy.
Test Script
import numpy as np
import cupy as cp
def job( nsamples ):
# Do some CuPy tasks in GPU
d_a = cp.random.randn( nsamples )
d_b = cp.random.randint( -3, high=3, size=nsamples )
d_result = ( d_a + d_b )
d_hist, _ = cp.histogram( d_result, bins=cp.array([-3,-2,-1,0,1,2,3,4]) )
std = cp.std( d_hist )
return std
# Perform 1 job in GPU
nsamples = 10 #can be as large as tens to hundreds of thousands
std = job( nsamples, 0 )
print( 'std', std, type(std) )
Update:
# Create Cuda streams
d_streams = []
for i in range(0, 10):
d_streams.append( cp.cuda.stream.Stream( non_blocking=True ) )
# Perform Concurrent jobs via Cuda Stream.
results = []
for stream in d_streams:
with stream:
results.append( job( nsamples ) )
print( 'results', results, len(results), type(std) )
After reading this Nvidia developer blog on Cuda Stream, this CuPy issue on Support CUDA stream with stream memory pool and this SOF question on CuPy Concurrency, I have tried the above which seems to work. However, I don't know how to see whether the jobs are running concurrently or serially.
Questions:
How do I profile Cupy's execution of the jobs in the GPU to evaluate my script is doing what I want? Ans:
nvprof --print-gpu-trace python filename.py
Is there a limit on the number streams that I can issue (e.g. limited by some hardware) or is it "infinite"?