TL;DR
Initialization and assignment with cuda.local.array()
is ~3x faster than to_device()
transfers; however, in a simple real-world example using QuickSort, it is ~2x slower. Not sure why there's a ~6x difference between these two benchmarking cases.
For the sorting example and for reference, NumPy is about the same as cuda.local.array()
and CuPy is ~4x faster than to_device()
. I imagine the CuPy advantage is because I didn't implement shared memory.
Benchmark Code
The first benchmark is a dummy example where vectors (via either cuda.local.array()
or to_device()
are copied to out
. The second benchmark is a simple real-world example using row-wise QuickSort on 1,000,000 rows and 100 columns.
Initialization and assignment
The following is a simple test of the two cases: cuda.local.array()
, where values are initialized to zero locally on device, and to_device()
, where values are initialized to zero in the host code. The values are then assigned to out
.
# -*- coding: utf-8 -*-
"""
Created on Sat Aug 21 11:39:58 2021
@author: sterg
"""
import os
# needs to appear before cuda import
os.environ["NUMBA_ENABLE_CUDASIM"] = "0"
# set to "1" for more debugging, but slower performance
os.environ["NUMBA_CUDA_DEBUGINFO"] = "0"
from numba import cuda
from numba.types import int32, float32
import numpy as np
from time import time
bits = 32
nb_float = float32
nb_int = int32
np_float = np.float32
np_int = np.int32
rows = 1000000
# local array shape needs to be defined globally due to lack of dynamic array allocation support. Don't wrap with np.int32, etc. see https://github.com/numba/numba/issues/7314
cols = 100
shape = (rows, cols)
test_vals = np.zeros(shape)
print(rows)
@cuda.jit("void(float{0}[:,:])".format(bits))
def local_array(out):
# CUDA iterator
idx = cuda.grid(1)
nrows = out.shape[0]
if idx < nrows:
# allocate local arrays
vec = cuda.local.array(cols, nb_float)
# foo stuff for MWE
for i in range(cols):
vec[i] = 0.0
out[idx, i] = vec[i]
@cuda.jit("void(float{0}[:,:], float{0}[:,:])".format(bits))
def device_array(mat, out):
# CUDA iterator
idx = cuda.grid(1)
nrows = mat.shape[0]
if idx < nrows:
vec = mat[idx]
# foo stuff for MWE
for i in range(cols):
out[idx, i] = vec[i]
def local_array_driver(nrows):
# block, grid
block_dim = 5
grid_dim = int(nrows / block_dim + 1)
# CUDA
stream = cuda.stream()
cuda_out = cuda.device_array((nrows, cols), np_float)
local_array[grid_dim, block_dim](cuda_out)
out = cuda_out.copy_to_host(stream=stream)
return out
def device_array_driver(mat):
# block, grid
block_dim = 5
nrows = mat.shape[0]
grid_dim = int(nrows / block_dim + 1)
# CUDA
stream = cuda.stream()
cuda_out = cuda.device_array((nrows, cols), np_float)
cuda_mat = cuda.to_device(np.asarray(mat, dtype=np_float), stream=stream)
device_array[grid_dim, block_dim](cuda_mat, cuda_out)
out = cuda_out.copy_to_host(stream=stream)
return out
# modified from https://numba.pydata.org/numba-doc/dev/user/5minguide.html
# compilation
local_array_driver(100)
device_array_driver(test_vals[0:100])
# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = local_array_driver(rows)
end = time()
print("Elapsed (local: after compilation) = %s" % (end - start))
# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = device_array_driver(test_vals)
end = time()
print("Elapsed (device: after compilation) = %s" % (end - start))
Sorting Example
The following benchmark tests the following two cases and performs a simple routine (quicksort): allocate 3 arrays locally via cuda.local.array()
or pass preallocated arrays via cuda.to_device()
.
# -*- coding: utf-8 -*-
"""
Created on Fri Aug 20 16:19:34 2021
@author: sterg
"""
import os
# needs to appear before cuda import
os.environ["NUMBA_ENABLE_CUDASIM"] = "0"
# set to "1" for more debugging, but slower performance
os.environ["NUMBA_CUDA_DEBUGINFO"] = "0"
from numba import njit, cuda
from numba.types import int32, float32
import numpy as np
from time import time
from warnings import warn
from pdb import set_trace
bits = 32
nb_float = float32
nb_int = int32
np_float = np.float32
np_int = np.int32
rows = 2000000
cols = 100
test_vals = np.random.rand(rows, cols)
print(rows)
# local array shape needs to be defined globally due to lack of dynamic array allocation support. Don't wrap with np.int32, etc. see https://github.com/numba/numba/issues/7314
verbose = True
@cuda.jit(device=True)
def partition(arr, ids, l, h):
"""
Partition using pivot.
Function takes last element as pivot, places the pivot element at its correct
position in sorted array, and places all smaller (smaller than pivot) to left of
pivot and all greater elements to right of pivot
Source: Modified from https://www.geeksforgeeks.org/iterative-quick-sort/
which was contributed by Mohit Kumra.
Parameters
----------
arr : vector of floats
The array to be sorted.
ids : vector of ints
The unsorted IDs corresponding to arr, in other words range(len(arr)).
l : int
Starting index for sorting.
h : int
Ending index for sorting.
Returns
-------
int
the new pivot?
"""
# index of smaller element
i = l - 1
pivot = arr[h]
for j in range(l, h):
# If current element is smaller than or equal to pivot
if arr[j] <= pivot:
# increment index of smaller element
i += 1
arr[i], arr[j] = arr[j], arr[i]
ids[i], ids[j] = ids[j], ids[i]
arr[i + 1], arr[h] = arr[h], arr[i + 1]
ids[i + 1], ids[h] = ids[h], ids[i + 1]
return i + 1
@cuda.jit(device=True)
def quickSortIterative(arr, stack, ids):
"""
Perform iterative quicksort on array and an unsorted ID list of the array.
Source: Modified from https://www.geeksforgeeks.org/iterative-quick-sort/
which was contributed by Mohit Kumra.
Parameters
----------
arr : vector of floats
The array to be sorted.
stack : vector of ints
Array initialized with 0's
ids : vector of ints
The unsorted IDs corresponding to arr, in other words range(len(arr)).
Returns
-------
None.
"""
# low and high indices.
l, h = (0, len(arr) - 1)
# stack = [0] * size
# ids = list(range(len(arr)))
# initialize top of stack
top = -1
# fill ids with range(len(arr))
for i in range(len(arr)):
ids[i] = i
stack[i] = 0
# push initial values of l and h to stack
top = top + 1
stack[top] = l
top = top + 1
stack[top] = h
# Keep popping from stack while is not empty
while top >= 0:
# Pop h and l
h = stack[top]
top = top - 1
l = stack[top]
top = top - 1
# Set pivot element at its correct position in
# sorted array
p = partition(arr, ids, l, h)
# If there are elements on left side of pivot,
# then push left side to stack
if p - 1 > l:
top = top + 1
stack[top] = l
top = top + 1
stack[top] = p - 1
# If there are elements on right side of pivot,
# then push right side to stack
if p + 1 < h:
top = top + 1
stack[top] = p + 1
top = top + 1
stack[top] = h
@cuda.jit(device=True)
def quickSortIterativeDevice(arr, stack, ids):
"""
Perform iterative quicksort on array and an unsorted ID list of the array.
Source: Modified from https://www.geeksforgeeks.org/iterative-quick-sort/
which was contributed by Mohit Kumra.
Parameters
----------
arr : vector of floats
The array to be sorted.
stack : vector of ints
Array initialized with 0's
ids : vector of ints
The unsorted IDs corresponding to arr, in other words range(len(arr)).
Returns
-------
None.
"""
# low and high indices.
l, h = (0, len(arr) - 1)
# stack = [0] * size
# ids = list(range(len(arr)))
# initialize top of stack
top = -1
# push initial values of l and h to stack
top = top + 1
stack[top] = l
top = top + 1
stack[top] = h
# Keep popping from stack while is not empty
while top >= 0:
# Pop h and l
h = stack[top]
top = top - 1
l = stack[top]
top = top - 1
# Set pivot element at its correct position in
# sorted array
p = partition(arr, ids, l, h)
# If there are elements on left side of pivot,
# then push left side to stack
if p - 1 > l:
top = top + 1
stack[top] = l
top = top + 1
stack[top] = p - 1
# If there are elements on right side of pivot,
# then push right side to stack
if p + 1 < h:
top = top + 1
stack[top] = p + 1
top = top + 1
stack[top] = h
# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPBx = 16
TPBy = cols
@cuda.jit("void(float{0}[:,:], float{0}[:,:])".format(bits))
def local_array_sort(mat, out):
# CUDA iterator
idx = cuda.grid(1)
nrows = mat.shape[0]
if idx < nrows:
vec = mat[idx]
vals = cuda.local.array(cols, nb_float)
sorter = cuda.local.array(cols, nb_int)
stack = cuda.local.array(cols, nb_int)
# fill vals with data
for i in range(cols):
vals[i] = vec[i]
# sorting
quickSortIterative(vals, stack, sorter)
# foo stuff using vals and sorter, in this case just outputting the values (MWE)
for i in range(cols):
out[idx, i] = vals[i]
# # for debugging
# x = cuda.threadIdx.x
# bx = cuda.blockIdx.x
# if x == 0 and bx == 0:
# set_trace()
@cuda.jit("void(float{0}[:,:], int{0}[:,:], int{0}[:,:], float{0}[:,:])".format(bits))
def device_array_sort(mat, stack, sorter, out):
# CUDA iterator
idx = cuda.grid(1)
nrows = mat.shape[0]
if idx < nrows:
vec = mat[idx]
substack = stack[idx]
subsorter = sorter[idx]
# sorting (substack is already initialized to 0's, subsorter to list(range(cols)))
quickSortIterativeDevice(vec, substack, subsorter)
# foo stuff using vals and sorter, in this case just outputting the values (MWE)
for i in range(cols):
out[idx, i] = vec[i]
# # for debugging
# x = cuda.threadIdx.x
# bx = cuda.blockIdx.x
# if x == 0 and bx == 0:
# set_trace()
def local_array_driver(mat):
# block, grid
block_dim = 5
nrows = mat.shape[0]
grid_dim = int(nrows / block_dim + 1)
# CUDA
stream = cuda.stream()
cuda_mat = cuda.to_device(np.asarray(mat, dtype=np_float), stream=stream)
cuda_out = cuda.device_array((nrows, cols), np_float)
local_array_sort[grid_dim, block_dim](cuda_mat, cuda_out)
out = cuda_out.copy_to_host(stream=stream)
return out
def device_array_driver(mat):
# block, grid
block_dim = 5
nrows = mat.shape[0]
grid_dim = int(nrows / block_dim + 1)
# CUDA
stream = cuda.stream()
cuda_mat = cuda.to_device(np.asarray(mat, dtype=np_float), stream=stream)
cuda_out = cuda.device_array((nrows, cols), np_float)
stack = np.zeros((nrows, cols))
sorter = np.tile(list(range(cols)), (nrows, 1))
cuda_stack = cuda.to_device(np.asarray(stack, dtype=np_int), stream=stream)
cuda_sorter = cuda.to_device(np.asarray(sorter, dtype=np_int), stream=stream)
device_array_sort[grid_dim, block_dim](cuda_mat, cuda_stack, cuda_sorter, cuda_out)
out = cuda_out.copy_to_host(stream=stream)
return out
# modified from https://numba.pydata.org/numba-doc/dev/user/5minguide.html
# compilation
local_array_driver(test_vals[0:100])
device_array_driver(test_vals[0:100])
# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = local_array_driver(test_vals)
end = time()
print("Elapsed (local: after compilation) = %s" % (end - start))
# Check if a numpy array is sorted: https://stackoverflow.com/questions/47004506
issorted = np.all(np.all(np.diff(out) >= 0, axis=1))
if not issorted:
raise ValueError("out is not sorted")
elif verbose:
print("out is sorted")
# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = device_array_driver(test_vals)
end = time()
print("Elapsed (device: after compilation) = %s" % (end - start))
# Check if a numpy array is sorted: https://stackoverflow.com/questions/47004506
issorted = np.all(np.all(np.diff(out) >= 0, axis=1))
if not issorted:
warn("out is not sorted")
elif verbose:
print("out is sorted")
start = time()
out = np.sort(test_vals)
end = time()
print("Elapsed (NumPy) = %s" % (end - start))
# # not fully implemented/tested
# @cuda.jit("void(float{0}[:,:], float{0}[:,:])".format(bits))
# def local_array_shared_sort(mat, out):
# # Define an array in the shared memory
# # The size and type of the arrays must be known at compile time
# smat = cuda.shared.array(shape=(TPBx, TPBy), dtype=float32)
# x, y = cuda.grid(2)
# tx = cuda.threadIdx.x
# ty = cuda.threadIdx.y
# bpg = cuda.gridDim.x # blocks per grid
# nrows = mat.shape[0]
# if x >= nrows and y >= cols:
# # Quit if (x, y) is outside of valid C boundary
# return
# # allocate local arrays
# vals = cuda.local.array(cols, nb_float)
# sorter = cuda.local.array(cols, nb_int)
# stack = cuda.local.array(cols, nb_int)
# for i in range(bpg):
# # Preload data into shared memory
# smat[tx, ty] = mat[x, ty + i * TPBx]
# # Wait until all threads finish preloading
# cuda.syncthreads()
# # fill vals with some data (just for MWE)
# for i in range(cols):
# vals[i] = smat[i]
# # sorting
# quickSortIterative(vals, stack, sorter)
# # foo stuff using vals and sorter, in this case just outputting the values (MWE)
# for i in range(cols):
# out[x, i] = vals[i]
# # Wait until all threads finish computing (is this one unnecessary?)
# cuda.syncthreads()
Benchmark Results
Initialization and assignment
For this dummy example, it is ~3x faster to use local arrays.
1000000
Elapsed (local: after compilation) = 0.14660310745239258
Elapsed (device: after compilation) = 0.4358341693878174
Sorting
However, implementing QuickSort with 1,000,000 rows and 100 columns gives:
1000000
Elapsed (local: after compilation) = 2.4355175495147705
Elapsed (device: after compilation) = 1.1668529510498047
out is sorted
Elapsed (NumPy) = 2.304871082305908
and with 2,000,000 rows:
> 2000000
> Elapsed (local: after compilation) = 4.839058876037598
> Elapsed (device: after compilation) = 2.2948694229125977
> out is sorted
> Elapsed (NumPy) = 4.541851282119751
Note that the memory requirements are about 2-3x higher with device_array_sort
. So it looks like my use of 3 local arrays in this sorting of one or two million 100-element vectors has a slowdown of ~2x. In other words, preallocating the arrays and sending them to the device is uglier (more arguments) and less memory efficient, but overall faster.
CuPy comparison
import cupy as cp
from time import time
rows = 1000000
cols = 100
test_vals = cp.random.rand(rows, cols)
t = time()
cp.sort(test_vals)
print("Elapsed (CuPy) = %s" % (time() - t))
Elapsed (CuPy) = 0.5325748920440674
Ran out of memory for 2,000,000 case.
Note on Shared Memory
Shared memory was not used in this example, but incorporating it should result in further speed gains (probably comparable to CuPy implementation). See Numba Docs: CUDA: Examples: Matrix Multiplication for an example of implementing shared memory. As of 2021-08-21, this has some bugs which are described and fixed in How to generalize fast matrix multiplication on GPU using numba.