How is performance affected by using numba.cuda.local.array() compared with numba.cuda.to_device()?

Question

Under Numba Docs: Memory Management, for numba.cuda.local.array() it states:

Local Memory

Local memory is an area of memory private to each thread. Using local memory helps allocate some scratchpad area when scalar local variables are not enough. The memory is allocated once for the duration of the kernel, unlike traditional dynamic memory management.

numba.cuda.local.array(shape, type)

Allocate a local array of the given shape and type on the device. shape is either an integer or a tuple of integers representing the array’s dimensions and must be a simple constant expression. type is a Numba type of the elements needing to be stored in the array. The array is private to the current thread. An array-like object is returned which can be read and written to like any standard array (e.g. through indexing).

^{Note that the Numba docs for this may need to be updated regarding use of the term "allocate" (see Numba PR and comment below)}

Under to_device(), it states:

numba.cuda.to_device(obj, stream=0, copy=True, to=None)

Allocate and transfer a numpy ndarray or structured scalar to the device.

To copy host->device a numpy array:

ary = np.arange(10)

d_ary = cuda.to_device(ary)

How does the use of one or the other affect performance (runtime and memory)?

Local arrays are never "allocated". CUDA local memory (which is what this maps to in Numba) is always statically defined by the compiler. There is no allocation overhead when you run a kernel, memory is taken from a statically reserved pool of local DRAM set up when the GPU function gets loaded into the context, based on what the compiler reserved in the compiled code and the maximum number of concurrent threads for the target GPU. They are zero allocation cost at kernel runtime — talonmies, Aug 21 '21 at 07:54

Sterling · Answer 1 · 2021-08-22T01:37:28.480

TL;DR

Initialization and assignment with cuda.local.array() is ~3x faster than to_device() transfers; however, in a simple real-world example using QuickSort, it is ~2x slower. Not sure why there's a ~6x difference between these two benchmarking cases.

For the sorting example and for reference, NumPy is about the same as cuda.local.array() and CuPy is ~4x faster than to_device(). I imagine the CuPy advantage is because I didn't implement shared memory.

Benchmark Code

The first benchmark is a dummy example where vectors (via either cuda.local.array() or to_device() are copied to out. The second benchmark is a simple real-world example using row-wise QuickSort on 1,000,000 rows and 100 columns.

Initialization and assignment

The following is a simple test of the two cases: cuda.local.array(), where values are initialized to zero locally on device, and to_device(), where values are initialized to zero in the host code. The values are then assigned to out.

# -*- coding: utf-8 -*-
"""
Created on Sat Aug 21 11:39:58 2021

@author: sterg
"""

import os

# needs to appear before cuda import
os.environ["NUMBA_ENABLE_CUDASIM"] = "0"
# set to "1" for more debugging, but slower performance
os.environ["NUMBA_CUDA_DEBUGINFO"] = "0"

from numba import cuda
from numba.types import int32, float32

import numpy as np

from time import time

bits = 32
nb_float = float32
nb_int = int32
np_float = np.float32
np_int = np.int32

rows = 1000000
# local array shape needs to be defined globally due to lack of dynamic array allocation support. Don't wrap with np.int32, etc. see https://github.com/numba/numba/issues/7314
cols = 100

shape = (rows, cols)
test_vals = np.zeros(shape)
print(rows)


@cuda.jit("void(float{0}[:,:])".format(bits))
def local_array(out):
    # CUDA iterator
    idx = cuda.grid(1)
    nrows = out.shape[0]

    if idx < nrows:
        # allocate local arrays
        vec = cuda.local.array(cols, nb_float)

        # foo stuff for MWE
        for i in range(cols):
            vec[i] = 0.0
            out[idx, i] = vec[i]


@cuda.jit("void(float{0}[:,:], float{0}[:,:])".format(bits))
def device_array(mat, out):
    # CUDA iterator
    idx = cuda.grid(1)
    nrows = mat.shape[0]

    if idx < nrows:
        vec = mat[idx]

        # foo stuff for MWE
        for i in range(cols):
            out[idx, i] = vec[i]


def local_array_driver(nrows):
    # block, grid
    block_dim = 5
    grid_dim = int(nrows / block_dim + 1)

    # CUDA
    stream = cuda.stream()
    cuda_out = cuda.device_array((nrows, cols), np_float)
    local_array[grid_dim, block_dim](cuda_out)

    out = cuda_out.copy_to_host(stream=stream)

    return out


def device_array_driver(mat):
    # block, grid
    block_dim = 5
    nrows = mat.shape[0]
    grid_dim = int(nrows / block_dim + 1)

    # CUDA
    stream = cuda.stream()
    cuda_out = cuda.device_array((nrows, cols), np_float)
    cuda_mat = cuda.to_device(np.asarray(mat, dtype=np_float), stream=stream)
    device_array[grid_dim, block_dim](cuda_mat, cuda_out)

    out = cuda_out.copy_to_host(stream=stream)

    return out


# modified from https://numba.pydata.org/numba-doc/dev/user/5minguide.html
# compilation
local_array_driver(100)
device_array_driver(test_vals[0:100])

# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = local_array_driver(rows)
end = time()
print("Elapsed (local: after compilation) = %s" % (end - start))

# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = device_array_driver(test_vals)
end = time()
print("Elapsed (device: after compilation) = %s" % (end - start))

Sorting Example

The following benchmark tests the following two cases and performs a simple routine (quicksort): allocate 3 arrays locally via cuda.local.array() or pass preallocated arrays via cuda.to_device().

# -*- coding: utf-8 -*-
"""
Created on Fri Aug 20 16:19:34 2021

@author: sterg
"""

import os

# needs to appear before cuda import
os.environ["NUMBA_ENABLE_CUDASIM"] = "0"
# set to "1" for more debugging, but slower performance
os.environ["NUMBA_CUDA_DEBUGINFO"] = "0"

from numba import njit, cuda
from numba.types import int32, float32

import numpy as np

from time import time
from warnings import warn
from pdb import set_trace

bits = 32
nb_float = float32
nb_int = int32
np_float = np.float32
np_int = np.int32

rows = 2000000
cols = 100
test_vals = np.random.rand(rows, cols)
print(rows)
# local array shape needs to be defined globally due to lack of dynamic array allocation support. Don't wrap with np.int32, etc. see https://github.com/numba/numba/issues/7314

verbose = True


@cuda.jit(device=True)
def partition(arr, ids, l, h):
    """
    Partition using pivot.

    Function takes last element as pivot, places the pivot element at its correct
    position in sorted array, and places all smaller (smaller than pivot) to left of
    pivot and all greater elements to right of pivot

    Source: Modified from https://www.geeksforgeeks.org/iterative-quick-sort/
    which was contributed by Mohit Kumra.

    Parameters
    ----------
    arr : vector of floats
        The array to be sorted.
    ids : vector of ints
        The unsorted IDs corresponding to arr, in other words range(len(arr)).
    l : int
        Starting index for sorting.
    h : int
        Ending index for sorting.

    Returns
    -------
    int
        the new pivot?

    """
    # index of smaller element
    i = l - 1

    pivot = arr[h]

    for j in range(l, h):

        # If current element is smaller than or equal to pivot
        if arr[j] <= pivot:

            # increment index of smaller element
            i += 1
            arr[i], arr[j] = arr[j], arr[i]
            ids[i], ids[j] = ids[j], ids[i]

    arr[i + 1], arr[h] = arr[h], arr[i + 1]
    ids[i + 1], ids[h] = ids[h], ids[i + 1]

    return i + 1


@cuda.jit(device=True)
def quickSortIterative(arr, stack, ids):
    """
    Perform iterative quicksort on array and an unsorted ID list of the array.

    Source: Modified from https://www.geeksforgeeks.org/iterative-quick-sort/
    which was contributed by Mohit Kumra.

    Parameters
    ----------
    arr : vector of floats
        The array to be sorted.
    stack : vector of ints
        Array initialized with 0's
    ids : vector of ints
        The unsorted IDs corresponding to arr, in other words range(len(arr)).

    Returns
    -------
    None.

    """
    # low and high indices.
    l, h = (0, len(arr) - 1)
    # stack = [0] * size
    # ids = list(range(len(arr)))

    # initialize top of stack
    top = -1

    # fill ids with range(len(arr))
    for i in range(len(arr)):
        ids[i] = i
        stack[i] = 0

    # push initial values of l and h to stack
    top = top + 1
    stack[top] = l
    top = top + 1
    stack[top] = h

    # Keep popping from stack while is not empty
    while top >= 0:

        # Pop h and l
        h = stack[top]
        top = top - 1
        l = stack[top]
        top = top - 1

        # Set pivot element at its correct position in
        # sorted array
        p = partition(arr, ids, l, h)

        # If there are elements on left side of pivot,
        # then push left side to stack
        if p - 1 > l:
            top = top + 1
            stack[top] = l
            top = top + 1
            stack[top] = p - 1

        # If there are elements on right side of pivot,
        # then push right side to stack
        if p + 1 < h:
            top = top + 1
            stack[top] = p + 1
            top = top + 1
            stack[top] = h


@cuda.jit(device=True)
def quickSortIterativeDevice(arr, stack, ids):
    """
    Perform iterative quicksort on array and an unsorted ID list of the array.

    Source: Modified from https://www.geeksforgeeks.org/iterative-quick-sort/
    which was contributed by Mohit Kumra.

    Parameters
    ----------
    arr : vector of floats
        The array to be sorted.
    stack : vector of ints
        Array initialized with 0's
    ids : vector of ints
        The unsorted IDs corresponding to arr, in other words range(len(arr)).

    Returns
    -------
    None.

    """
    # low and high indices.
    l, h = (0, len(arr) - 1)
    # stack = [0] * size
    # ids = list(range(len(arr)))

    # initialize top of stack
    top = -1

    # push initial values of l and h to stack
    top = top + 1
    stack[top] = l
    top = top + 1
    stack[top] = h

    # Keep popping from stack while is not empty
    while top >= 0:

        # Pop h and l
        h = stack[top]
        top = top - 1
        l = stack[top]
        top = top - 1

        # Set pivot element at its correct position in
        # sorted array
        p = partition(arr, ids, l, h)

        # If there are elements on left side of pivot,
        # then push left side to stack
        if p - 1 > l:
            top = top + 1
            stack[top] = l
            top = top + 1
            stack[top] = p - 1

        # If there are elements on right side of pivot,
        # then push right side to stack
        if p + 1 < h:
            top = top + 1
            stack[top] = p + 1
            top = top + 1
            stack[top] = h


# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPBx = 16
TPBy = cols


@cuda.jit("void(float{0}[:,:], float{0}[:,:])".format(bits))
def local_array_sort(mat, out):
    # CUDA iterator
    idx = cuda.grid(1)
    nrows = mat.shape[0]

    if idx < nrows:
        vec = mat[idx]
        vals = cuda.local.array(cols, nb_float)
        sorter = cuda.local.array(cols, nb_int)
        stack = cuda.local.array(cols, nb_int)

        # fill vals with data
        for i in range(cols):
            vals[i] = vec[i]

        # sorting
        quickSortIterative(vals, stack, sorter)

        # foo stuff using vals and sorter, in this case just outputting the values (MWE)
        for i in range(cols):
            out[idx, i] = vals[i]

        # # for debugging
        # x = cuda.threadIdx.x
        # bx = cuda.blockIdx.x
        # if x == 0 and bx == 0:
        #     set_trace()


@cuda.jit("void(float{0}[:,:], int{0}[:,:], int{0}[:,:], float{0}[:,:])".format(bits))
def device_array_sort(mat, stack, sorter, out):
    # CUDA iterator
    idx = cuda.grid(1)
    nrows = mat.shape[0]

    if idx < nrows:
        vec = mat[idx]
        substack = stack[idx]
        subsorter = sorter[idx]

        # sorting (substack is already initialized to 0's, subsorter to list(range(cols)))
        quickSortIterativeDevice(vec, substack, subsorter)

        # foo stuff using vals and sorter, in this case just outputting the values (MWE)
        for i in range(cols):
            out[idx, i] = vec[i]

        # # for debugging
        # x = cuda.threadIdx.x
        # bx = cuda.blockIdx.x
        # if x == 0 and bx == 0:
        #     set_trace()


def local_array_driver(mat):
    # block, grid
    block_dim = 5
    nrows = mat.shape[0]

    grid_dim = int(nrows / block_dim + 1)

    # CUDA
    stream = cuda.stream()
    cuda_mat = cuda.to_device(np.asarray(mat, dtype=np_float), stream=stream)
    cuda_out = cuda.device_array((nrows, cols), np_float)

    local_array_sort[grid_dim, block_dim](cuda_mat, cuda_out)

    out = cuda_out.copy_to_host(stream=stream)

    return out


def device_array_driver(mat):
    # block, grid
    block_dim = 5
    nrows = mat.shape[0]
    grid_dim = int(nrows / block_dim + 1)

    # CUDA
    stream = cuda.stream()

    cuda_mat = cuda.to_device(np.asarray(mat, dtype=np_float), stream=stream)
    cuda_out = cuda.device_array((nrows, cols), np_float)

    stack = np.zeros((nrows, cols))
    sorter = np.tile(list(range(cols)), (nrows, 1))
    cuda_stack = cuda.to_device(np.asarray(stack, dtype=np_int), stream=stream)
    cuda_sorter = cuda.to_device(np.asarray(sorter, dtype=np_int), stream=stream)

    device_array_sort[grid_dim, block_dim](cuda_mat, cuda_stack, cuda_sorter, cuda_out)

    out = cuda_out.copy_to_host(stream=stream)

    return out


# modified from https://numba.pydata.org/numba-doc/dev/user/5minguide.html
# compilation
local_array_driver(test_vals[0:100])
device_array_driver(test_vals[0:100])

# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = local_array_driver(test_vals)
end = time()
print("Elapsed (local: after compilation) = %s" % (end - start))

# Check if a numpy array is sorted: https://stackoverflow.com/questions/47004506
issorted = np.all(np.all(np.diff(out) >= 0, axis=1))
if not issorted:
    raise ValueError("out is not sorted")
elif verbose:
    print("out is sorted")

# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = device_array_driver(test_vals)
end = time()
print("Elapsed (device: after compilation) = %s" % (end - start))

# Check if a numpy array is sorted: https://stackoverflow.com/questions/47004506
issorted = np.all(np.all(np.diff(out) >= 0, axis=1))
if not issorted:
    warn("out is not sorted")
elif verbose:
    print("out is sorted")

start = time()
out = np.sort(test_vals)
end = time()
print("Elapsed (NumPy) = %s" % (end - start))


# # not fully implemented/tested
# @cuda.jit("void(float{0}[:,:], float{0}[:,:])".format(bits))
# def local_array_shared_sort(mat, out):

#     # Define an array in the shared memory
#     # The size and type of the arrays must be known at compile time
#     smat = cuda.shared.array(shape=(TPBx, TPBy), dtype=float32)

#     x, y = cuda.grid(2)

#     tx = cuda.threadIdx.x
#     ty = cuda.threadIdx.y
#     bpg = cuda.gridDim.x  # blocks per grid

#     nrows = mat.shape[0]
#     if x >= nrows and y >= cols:
#         # Quit if (x, y) is outside of valid C boundary
#         return

#     # allocate local arrays
#     vals = cuda.local.array(cols, nb_float)
#     sorter = cuda.local.array(cols, nb_int)
#     stack = cuda.local.array(cols, nb_int)

#     for i in range(bpg):
#         # Preload data into shared memory
#         smat[tx, ty] = mat[x, ty + i * TPBx]

#         # Wait until all threads finish preloading
#         cuda.syncthreads()

#         # fill vals with some data (just for MWE)
#         for i in range(cols):
#             vals[i] = smat[i]

#         # sorting
#         quickSortIterative(vals, stack, sorter)

#         # foo stuff using vals and sorter, in this case just outputting the values (MWE)
#         for i in range(cols):
#             out[x, i] = vals[i]

#         # Wait until all threads finish computing (is this one unnecessary?)
#         cuda.syncthreads()

Benchmark Results

Initialization and assignment

For this dummy example, it is ~3x faster to use local arrays.

1000000
Elapsed (local: after compilation) = 0.14660310745239258
Elapsed (device: after compilation) = 0.4358341693878174

Sorting

However, implementing QuickSort with 1,000,000 rows and 100 columns gives:

1000000
Elapsed (local: after compilation) = 2.4355175495147705
Elapsed (device: after compilation) = 1.1668529510498047
out is sorted
Elapsed (NumPy) = 2.304871082305908

and with 2,000,000 rows:

> 2000000
> Elapsed (local: after compilation) = 4.839058876037598
> Elapsed (device: after compilation) = 2.2948694229125977
> out is sorted
> Elapsed (NumPy) = 4.541851282119751

Note that the memory requirements are about 2-3x higher with device_array_sort. So it looks like my use of 3 local arrays in this sorting of one or two million 100-element vectors has a slowdown of ~2x. In other words, preallocating the arrays and sending them to the device is uglier (more arguments) and less memory efficient, but overall faster.

CuPy comparison

import cupy as cp
from time import time
rows = 1000000
cols = 100
test_vals = cp.random.rand(rows, cols)
t = time()
cp.sort(test_vals)
print("Elapsed (CuPy) = %s" % (time() - t))

Elapsed (CuPy) = 0.5325748920440674

Ran out of memory for 2,000,000 case.

Note on Shared Memory

Shared memory was not used in this example, but incorporating it should result in further speed gains (probably comparable to CuPy implementation). See Numba Docs: CUDA: Examples: Matrix Multiplication for an example of implementing shared memory. As of 2021-08-21, this has some bugs which are described and fixed in How to generalize fast matrix multiplication on GPU using numba.

How is performance affected by using numba.cuda.local.array() compared with numba.cuda.to_device()?

Local Memory

1 Answers1

TL;DR

Benchmark Code

Initialization and assignment

Sorting Example

Benchmark Results

Initialization and assignment

Sorting

CuPy comparison

Note on Shared Memory

Linked