Alternatives
Here are some alternatives to both cuda.local.array()
and individually passing in arguments via cuda.to_device()
:
- Allocate a single concatenated vector/matrix (called e.g.
local_args
) which actually represents something like 15 variables. This has the downside of requiring constantly slicing into it and hoping that you don't accidentally use indices from a different "sub-variable" or break the ordering by adding new variables later, changing sizes, etc.
- Split the operations into sequentially called Numba/CUDA kernels, or a combination of Numba
cuda.jit()
, CuPy cupy.fuse()
calls, and/or other CUDA implementations. For example, if you have operations on a set of vectors that would otherwise be (expensively and redundantly) repeated many more times in a pairwise distance matrix computation (e.g. 10,0002 instead of 10,000 times), then consider performing those operations beforehand and passing them in as arguments (which can be combined with 1. or 3.)
- A handy alternative I came across is to define a custom NumPy
dtype
, though this may cause issues with the NVCC compiler (hopefully permanently fixed). A GitHub issue has an example as follows:
import numpy as np
np_int = np.int32
np_float = np.float32
cuda_const_arrays_type = np.dtype([
('a1', (np_int,(7776, 13))),
('a2', (np_int,(7776, 2, 5))),
('a3', (np_int,(16494592))),
('a4', (np_int,13)),
('a5', (np_float,(22528, 64))),
('a6', (np_int,(522523, 64))),
('a7', (np_int,(32,5))),
('a8', (np_int,(66667))),
('a9', (np_int,(252, 64, 3, 2, 2, 2, 2, 2, 2, 13))),
('a10', (np_int,(7776)))
])
cuda_const_arrays = np.zeros(1, dtype=cuda_const_arrays_type)
for txt in cuda_const_arrays_type.names: # i.e. ("a1", "a2", ...)
cuda_const_arrays[0][txt] = np.loadtxt(open(txt+".csv", "rb"), delimiter=",", skiprows=1)
gpu_const_arrays = cuda.to_device(cuda_const_arrays[0])
@cuda.jit(device=True)
def cuda_doSomething(gpu_const_arrays,...):
gpu_const_arrays.a1
An example from the same user can be found on Gitlab (OK to delete the import keras as ks
line). While this causes sporadic errors for previous Numba versions, it worked fine for numba 0.53.1
and cudatoolkit 11.2.2
, indicating that the "custom dtype" approach might be OK now.
In order to prevent unnecessarily passing of large amounts of data to functions lower in the stack trace, it may be appropriate to only pass a subset of the arguments in this custom dtype
, but I'm not sure how to do this.
Other Generally Useful Examples
While we're waiting on CuPy or NumPy support for Numba/CUDA 7 9 10 11, the following are examples I've found relevant/useful in the workflow of writing Numba/CUDA scripts.
Some of these examples are really nice because you can see the original, inefficient approach and how it was modified to become much more efficient, similar to the Numba Docs: CUDA: Matrix Multiplication example and see how others approached array allocation and argument passing in Numba/CUDA.