What are efficient alternatives to numba.cuda.local.array() that aren't as cumbersome as passing many arguments via to_device()?

Question

`cuda.local.array()`

In How is performance affected by using numba.cuda.local.array() compared with numba.cuda.to_device()? a benchmark of the simple quicksort algorithm demonstrates that using to_device to pass preallocated arrays can be ~2x more efficient, but this requires more memory.

The benchmark results for individually sorting 2,000,000 rows each with 100 elements is as follows:

2000000
Elapsed (local: after compilation) = 4.839058876037598
Elapsed (device: after compilation) = 2.2948694229125977
out is sorted
Elapsed (NumPy) = 4.541851282119751

Dummy Example using `to_device()`

If you have a complicated program that has many cuda.local.array() calls, the equivalent to_device version might start to look like this and get quite cumbersome:

def foo2(var1, var2, var3, var4, var5, var6, var7, var8, var9, var10, out):
    for i in range(len(var1)):
        out[i] = foo(var1, var2, var3, var4, var5, var6, var7, var8, var9, var10, out)

def foo3(var1, var2, var3, var4, var5, var6, var7, var8, var9, var10, out):
    idx = cuda.grid(1)
    foo(var1, var2, var3, var4, var5, var6, var7, var8, var9, var10, out[idx])

In a real codebase, there might be 3-4 levels of function nesting across tens of functions and hundreds to thousands of lines of code. What are alternatives to these two approaches?

Sterling · Answer 1 · 2021-08-21T05:51:45.227

Alternatives

Here are some alternatives to both cuda.local.array() and individually passing in arguments via cuda.to_device():

Allocate a single concatenated vector/matrix (called e.g. local_args) which actually represents something like 15 variables. This has the downside of requiring constantly slicing into it and hoping that you don't accidentally use indices from a different "sub-variable" or break the ordering by adding new variables later, changing sizes, etc.
Split the operations into sequentially called Numba/CUDA kernels, or a combination of Numba cuda.jit(), CuPy cupy.fuse() calls, and/or other CUDA implementations. For example, if you have operations on a set of vectors that would otherwise be (expensively and redundantly) repeated many more times in a pairwise distance matrix computation (e.g. 10,000² instead of 10,000 times), then consider performing those operations beforehand and passing them in as arguments (which can be combined with 1. or 3.)
A handy alternative I came across is to define a custom NumPy dtype, though this may cause issues with the NVCC compiler (hopefully permanently fixed). A GitHub issue has an example as follows:

import numpy as np
np_int = np.int32
np_float = np.float32
cuda_const_arrays_type = np.dtype([
('a1', (np_int,(7776, 13))),
('a2', (np_int,(7776, 2, 5))),
('a3', (np_int,(16494592))),
('a4', (np_int,13)),
('a5', (np_float,(22528, 64))),
('a6', (np_int,(522523, 64))),
('a7', (np_int,(32,5))),
('a8', (np_int,(66667))),
('a9', (np_int,(252, 64, 3, 2, 2, 2, 2, 2, 2, 13))),
('a10', (np_int,(7776)))
])
cuda_const_arrays = np.zeros(1, dtype=cuda_const_arrays_type)
for txt in cuda_const_arrays_type.names: # i.e. ("a1", "a2", ...)
    cuda_const_arrays[0][txt] = np.loadtxt(open(txt+".csv", "rb"), delimiter=",", skiprows=1)
gpu_const_arrays = cuda.to_device(cuda_const_arrays[0])

@cuda.jit(device=True)
def cuda_doSomething(gpu_const_arrays,...):
    gpu_const_arrays.a1

An example from the same user can be found on Gitlab (OK to delete the import keras as ks line). While this causes sporadic errors for previous Numba versions, it worked fine for numba 0.53.1 and cudatoolkit 11.2.2, indicating that the "custom dtype" approach might be OK now.

In order to prevent unnecessarily passing of large amounts of data to functions lower in the stack trace, it may be appropriate to only pass a subset of the arguments in this custom dtype, but I'm not sure how to do this.

Other Generally Useful Examples

While we're waiting on CuPy or NumPy support for Numba/CUDA 7 9 10 11, the following are examples I've found relevant/useful in the workflow of writing Numba/CUDA scripts.

Why numba cuda is running slow after recalling it several times?
accelerated FFT to be invoked from Python Numba CUDA kernel
Numba Discourse: Optimizing Code Further, CUDA Jit? (great suggestions and examples by Graham Markall)
Cuda Optimize Jaro Distance (nice example and explanation of implementation by Graham Markall)
Numba convolutions and the user's implementations in NumPy, CuPy, and Numba
How to generalize fast matrix multiplication on GPU using numba (extends/corrects Numba Docs matmul example)

Some of these examples are really nice because you can see the original, inefficient approach and how it was modified to become much more efficient, similar to the Numba Docs: CUDA: Matrix Multiplication example and see how others approached array allocation and argument passing in Numba/CUDA.

What are efficient alternatives to numba.cuda.local.array() that aren't as cumbersome as passing many arguments via to_device()?

cuda.local.array()

Dummy Example using to_device()

1 Answers1

Alternatives

Other Generally Useful Examples

`cuda.local.array()`

Dummy Example using `to_device()`