I recently made a CuPy
version of my numpy
code and I only get an improvement factor of x5-x15. When I check my GPU usage, it seems low (<1%). I want to optimize the way my code operates to get faster results.
Generally, I want to make multiple successive CuPy
operations on a cupy. ndarray
.
For example, generating a random vector:
def randomUniformUnitary(N):
theta = cp.random.rand(N) * 2 * cp.pi
phi = cp.random.rand(N) * cp.pi
x = cp.sin(phi) * cp.cos(theta)
y = cp.sin(phi) * cp.sin(theta)
z = cp.cos(phi)
output = cp.stack((x, y, z), axis=-1)
return output
I have multiple questions that the docs didn't seem to answer. (They do say on-the-fly kernel creation, but no explanations)
- Kernel merging?
Does CuPy creates a kernel for rand()
then sends back the data and creates a kernel for the multiplication of the 2
, then... Or will all these calculations combine into one faster kernel?
- Kernel combinaison criteria?
If it is the case, what are the criteria that lead to such behavior? One-line operations? Same array
operation? Function operations? Is it okay performance-wise to def
separate function with only one cupy
operation on an array, or is it better to double write the code at some places and get all the cupy functions into a single Python function?
- Own kernels?
If each calculation is done separately and there is no "kernel merging", then I feel I should probably make my own kernels to optimize? Is it the only way to achieve GPU optimization?