0

I recently made a CuPy version of my numpy code and I only get an improvement factor of x5-x15. When I check my GPU usage, it seems low (<1%). I want to optimize the way my code operates to get faster results.

Generally, I want to make multiple successive CuPy operations on a cupy. ndarray. For example, generating a random vector:

def randomUniformUnitary(N):
    theta = cp.random.rand(N) * 2 * cp.pi
    phi = cp.random.rand(N) * cp.pi
    x = cp.sin(phi) * cp.cos(theta)
    y = cp.sin(phi) * cp.sin(theta)
    z = cp.cos(phi)
    output = cp.stack((x, y, z), axis=-1)
    return output

I have multiple questions that the docs didn't seem to answer. (They do say on-the-fly kernel creation, but no explanations)

  1. Kernel merging?

Does CuPy creates a kernel for rand() then sends back the data and creates a kernel for the multiplication of the 2, then... Or will all these calculations combine into one faster kernel?

  1. Kernel combinaison criteria?

If it is the case, what are the criteria that lead to such behavior? One-line operations? Same array operation? Function operations? Is it okay performance-wise to def separate function with only one cupy operation on an array, or is it better to double write the code at some places and get all the cupy functions into a single Python function?

  1. Own kernels?

If each calculation is done separately and there is no "kernel merging", then I feel I should probably make my own kernels to optimize? Is it the only way to achieve GPU optimization?

PyThagoras
  • 195
  • 2
  • 18

1 Answers1

2
  1. Generally speaking, cupy does not create a single kernel encompassing the behavior of separate program statements. There is no automatic fusion. cupy has a fuse function which applies to user-defined kernels (see below)

  2. See item 1

  3. Yes, you can create your own kernels. cupy provides a variety of methods for you to create user-defined kernels, and this is another possible method for combining multiple operations into a single underlying kernel call.

You should be able to further characterize the above statements/behavior with a GPU profiler (or with inspection, since cupy is open source).

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • That did answer most of my questions, thank you for the time. Are we usually right if we say that user-defined kernels will be faster if multiple operations are incorporated into one since there would less data sharing between CPU and GPU? – PyThagoras Mar 14 '21 at 14:15
  • 2
    I think that might be a common observation. However cupy understands that you don't want to move data unnecessarily between CPU and GPU, and a typical sequence of cupy statements won't do that. cupy is designed to intentionally keep the data on the GPU until a specific need for it arises on the CPU (e.g. via a `.asnumpy` statement). The benefit from kernel fusion ("multiple operations are incorporated into one") doesn't depend on eliminating unnecessary CPU/GPU traffic, but is described [here](https://stackoverflow.com/questions/53305830/), generically. It saves memory bandwidth. – Robert Crovella Mar 14 '21 at 14:19