Cuda implementation unexpectedly too slow with Numba (Python)

Question

I am attempting to optimize the performance of a random Forward Kinematics equation in the field of robotics. In essence, this computational function accepts 6 joint values as input and produces a 4x4 array that represents the position and orientation of the robot's end effector. Each element of the array can be computed independently and involves a large combination of multiplications, sines, and cosines applied to the 6 input values. To generalize further, considering N configurations, each with 6 joints, the core function will receive a (N,6) input array and return a (N,4,4) array. Working examples are provided in the end of this post.

My goal is to compare the performance of these different implementations and find the speed limit when accelerating the forward kinematics calculation with Numba. I also included tiem execution for a 2_000_000 x 6 array:

@numba.guvectorize (with target=cpu/parallel/cuda)
- cpu: 92ms
- parallel: 21ms
- cuda: 200ms (only processing time)
@cuda.jit: 160ms (only processing time)

I am seeking suggestions to improve the performance of my CUDA implementation for the forward kinematics calculation. About the CUDA-based approaches:

@guvectorize with target="cuda": I believe this implementation is already optimized as it is written with the smallest input dimensions, maximizing the number of workers for CUDA. The @guvectorize decorator doesn't provide much flexibility for further improvements I think.
@cuda.jit: I suspect there is room for improvement in this implementation, as I am relatively new to CUDA programming. Here are a few ideas I have considered:
- [IMPLEMENTED ONE] Launching N blocks with 16 threads each and using conditional statements to assign each thread to compute a specific value in the 4x4 output array. However, I am aware that using conditional statements breaks the CUDA paradigm and may not be efficient.
- Launching separate kernels to compute each value in the 4x4 array. This would remove all if statements. However, this approach may introduce significant overhead due to the large number of kernel launches.

Therefore:

Is this type of computation suitable for GPU? I have my doubts. Although it is computationally intensive, each position of the output array (4x4) requires the execution of totally different instructions.
About the implementation with @guvectorize with target="cuda", is there a possibility to make it quicker?
About the implementation with @cuda.jit, what is the common approach people with more experience would address these type of computations?

Working code used (requires Numba and cuda installed):

from time import perf_counter

import numpy as np
from numba import bool_, cuda, float64, guvectorize


@guvectorize(
    [(float64[:], bool_[:, :], float64[:, :])],
    "(n), (m,m) -> (m,m)",
    nopython=True,
    target="cuda",
)
def guvectorized(joints, dummy, res):
    q1, q2, q3, q4, q5, q6 = joints
    d1, a1, a2, a3, d4, d6 = float64(10.), float64(2.), float64(3.), float64(3.), float64(2.), float64(5.)
    c1, c2, c4, c5, c6 = np.cos(q1), np.cos(q2), np.cos(q4), np.cos(q5), np.cos(q6)
    s1, s2, s4, s5, s6 = np.sin(q1), np.sin(q2), np.sin(q4), np.sin(q5), np.sin(q6)
    s23, c23 = np.sin(q3 + q2), np.cos(q2 + q3)

    res[0,0] = c6 * (c5 * (c1 * c23 * c4 + s1 * s4) - c1 * s23 * s5) + s6 * (
        s1 * c4 - c1 * c23 * s4
    )
    res[1, 0] = c6 * (c5 * (s1 * c23 * c4 + c1 * s4) - s1 * s23 * s5) - s6 * (
        c1 * c4 - s1 * c23 * s4
    )
    res[2,0] = c6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * s6
    res[0,1] = s6 * (
        c1 * s23 * s5 - c5 * (s1 * c23 * c4 + s1 * s4) + c6 * (s1 * c4 - c1 * c23 * s4)
    )
    res[1,1] = s6 * (s1 * s23 * s5 - c5 * (s1 * c23 * c4 + c1 * s4)) - c6 * (
        c1 * c4 + s1 * c23 * s4
    )
    res[2,1] = -s6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * c6
    res[0,2] = s5 * (c1 * c23 * c4 + s1 * s4) + c1 * s23 * c5
    res[1,2] = s5 * (s1 * c23 * c4 - c1 * s4) + s1 * s23 * c5
    res[2,2] = s23 * c4 * c5 - c23 * c5
    res[0,3] = d6 * (s5 * (c1 * c23 * c4 + s1 + s4) + c1 * s23 * c5) + c1 * (
        a1 + a2 * c2 + a3 * c23 + d4 * s23
    )
    res[1,3] = d6 * (s5 * (s1 * c23 * c4 + c1 * s4) + s1 * s23 * c5) + s1 * (
        a1 + a2 * c2 + a3 * c23 + d4 * s23
    )
    res[2,3] = a2 * s2 + d1 + a3 * s23 - d4 * c23 + d6 * (s23 * c4 * s5 - c23 * c5)
    res[3, 0] = 0.0
    res[3, 1] = 0.0
    res[3, 2] = 0.0
    res[3, 3] = 1.0

@cuda.jit
def cuda_jit(joints, res):
    i = cuda.threadIdx.x
    block_id = cuda.blockIdx.x
    q1, q2, q3, q4, q5, q6 = joints[block_id]
    d1, a1, a2, a3, d4, d6 = float64(10.), float64(2.), float64(3.), float64(3.), float64(2.), float64(5.)
    c1, c2, c4, c5, c6 = np.cos(q1), np.cos(q2), np.cos(q4), np.cos(q5), np.cos(q6)
    s1, s2, s4, s5, s6 = np.sin(q1), np.sin(q2), np.sin(q4), np.sin(q5), np.sin(q6)
    s23, c23 = np.sin(q3 + q2), np.cos(q2 + q3)


    if i == 0:
        res[block_id, 0, 0] = c6 * (c5 * (c1 * c23 * c4 + s1 * s4) - c1 * s23 * s5) + s6 * (
            s1 * c4 - c1 * c23 * s4
        )
    elif i == 1:
        res[block_id, 1, 0] = c6 * (c5 * (s1 * c23 * c4 + c1 * s4) - s1 * s23 * s5) - s6 * (
            c1 * c4 - s1 * c23 * s4
        )
    elif i == 2:
        res[block_id, 2, 0] = c6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * s6
    elif i == 3:
        res[block_id, 0, 1] = s6 * (
            c1 * s23 * s5 - c5 * (s1 * c23 * c4 + s1 * s4) + c6 * (s1 * c4 - c1 * c23 * s4)
        )
    elif i == 4:
        res[block_id, 1, 1] = s6 * (s1 * s23 * s5 - c5 * (s1 * c23 * c4 + c1 * s4)) - c6 * (
            c1 * c4 + s1 * c23 * s4
        )
    elif i == 5:
        res[block_id, 2, 1] = -s6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * c6
    elif i == 6:
        res[block_id, 0, 2] = s5 * (c1 * c23 * c4 + s1 * s4) + c1 * s23 * c5
    elif i == 7:
        res[block_id, 1, 2] = s5 * (s1 * c23 * c4 - c1 * s4) + s1 * s23 * c5
    elif i == 8:
        res[block_id, 2, 2] = s23 * c4 * c5 - c23 * c5
    elif i == 9:
        res[block_id, 0, 3] = d6 * (s5 * (c1 * c23 * c4 + s1 + s4) + c1 * s23 * c5) + c1 * (
            a1 + a2 * c2 + a3 * c23 + d4 * s23
        )
    elif i == 10:
        res[block_id, 1, 3] = d6 * (s5 * (s1 * c23 * c4 + c1 * s4) + s1 * s23 * c5) + s1 * (
            a1 + a2 * c2 + a3 * c23 + d4 * s23
        )
    elif i == 11:
        res[block_id, 2, 3] = a2 * s2 + d1 + a3 * s23 - d4 * c23 + d6 * (s23 * c4 * s5 - c23 * c5)
    elif i == 12:
        res[block_id, 3, 0] = float64(0.0)
        res[block_id, 3, 1] = float64(0.0)
        res[block_id, 3, 2] = float64(0.0)
        res[block_id, 3, 3] = float64(1.0)


if __name__ == "__main__":
    # Setup
    N = 1_000_000
    # Guvectorized arrays
    input_arr = np.random.rand(N, 6)
    dummy_arr = np.empty((4, 4), dtype=np.bool_)
    out = guvectorized(input_arr, dummy_arr)

    # Cuda.jit arrays
    input_arr_cu = cuda.to_device(input_arr)
    output_arr_cu = cuda.device_array((N, 4, 4))
    threads_per_block = 13
    blocks_per_grid = N
    cuda_jit[blocks_per_grid, threads_per_block](input_arr_cu, output_arr_cu)

    # Timing
    #   Guvectorize
    init = perf_counter()
    guvectorized(input_arr, dummy_arr)
    print(f"Time `guvectorize`: {perf_counter() - init}")

    #   Cuda jit
    init = perf_counter()
    cuda_jit[blocks_per_grid, threads_per_block](input_arr_cu, output_arr_cu)
    cuda.synchronize()
    print(f"Time `cuda.jit`: {perf_counter() - init}")

I'm sorry, but what are you hoping to achieve by running 13 threads per block, with each thread being completely divergent to the others in the block,? — talonmies, Jun 15 '23 at 09:25
Thanks for taking your time to read it! That's related to my first question. I already said that my current implementation is horrible from the point of view of CUDA. It was just an experiment as I barely had experience with it. Therefore, I was looking for a more professional point of view about how to address this type of problems. Maybe it is not even suitable for GPU — Manu, Jun 15 '23 at 10:58
The finest granularity of parallelism in CUDA is the warp. Each 32 consecutive threads of a block come from the same warp. A warp is similar to a SIMD register, i.e. everything works in lockstep and divergence causes sequentialization. So one possibility would be to have e.g. the first warp of the first block work only on `[0,0]` elements of the output, the second warp on `[1,1]` and so on. To still profit from coalesced memory access, you will probably need to load input cooperatively through shared memory. — paleonix, Jun 15 '23 at 16:30
If your real-world use case really has about a million values along the outer dimension, don't overthink it, just let each thread compute all 4x4 values. Coalescing the output memory access might need a bit much shared memory but it's probably still okay seeing how the computational density is pretty high and you probably don't need perfect occupancy — Homer512, Jun 15 '23 at 17:56
BTW: You are aware that consumer grade GPUs have rather [low double precision performance](https://www.thefpsreview.com/2020/09/21/geforce-rtx-3080-fe-gpgpu-compute-workstation-performance/3/), right? If you can, switch to single precision. — Homer512, Jun 15 '23 at 17:59
These answers were exactly what I was looking for. As all the example I saw were so "uniform" (summing all elements in array, multiplications...) I did not have a proper vision about how to think with these type of problems. I will continue studying then. Thanks for all the feedback! — Manu, Jun 16 '23 at 20:02

score 1 · Answer 1 · answered Jun 18 '23 at 04:04

In general, the CUDA programming paradigm likes every thread to do the same thing, with lots of floating point operations and a high computation to memory bandwidth ratio. If you have that, then your use case stands a good chance of seeing some benefit from using a GPU. Your use case doesn’t look like an intrinsically poor fit, using those points of reference. Your code, however, is another story.

I would start with something like this:

@cuda.jit
def cuda_jit(joints, res):
    Tid = numba.cuda.grid(1)

    q1, q2, q3, q4, q5, q6 = joints[Tid]
    d1, a1, a2, a3, d4, d6 = float64(10.), float64(2.), float64(3.), float64(3.), float64(2.), float64(5.)
    c1, c2, c4, c5, c6 = np.cos(q1), np.cos(q2), np.cos(q4), np.cos(q5), np.cos(q6)
    s1, s2, s4, s5, s6 = np.sin(q1), np.sin(q2), np.sin(q4), np.sin(q5), np.sin(q6)
    s23, c23 = np.sin(q3 + q2), np.cos(q2 + q3)


    res[Tid, 0, 0] = c6 * (c5 * (c1 * c23 * c4 + s1 * s4) - c1 * s23 * s5) + s6 * (
        s1 * c4 - c1 * c23 * s4
        )
    
     res[Tid, 1, 0] = c6 * (c5 * (s1 * c23 * c4 + c1 * s4) - s1 * s23 * s5) - s6 * (
        c1 * c4 - s1 * c23 * s4
       )
     res[Tid, 2, 0] = c6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * s6
    
     res[Tid, 0, 1] = s6 * (
        c1 * s23 * s5 - c5 * (s1 * c23 * c4 + s1 * s4) + c6 * (s1 * c4 - c1 * c23 * s4)
        )
    
     res[Tid, 1, 1] = s6 * (s1 * s23 * s5 - c5 * (s1 * c23 * c4 + c1 * s4)) - c6 * (
         c1 * c4 + s1 * c23 * s4
        )
    
     res[Tid, 2, 1] = -s6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * c6  
     res[Tid, 0, 2] = s5 * (c1 * c23 * c4 + s1 * s4) + c1 * s23 * c5
     res[Tid, 1, 2] = s5 * (s1 * c23 * c4 - c1 * s4) + s1 * s23 * c5
     res[Tid, 2, 2] = s23 * c4 * c5 - c23 * c5          
     res[Tid, 0, 3] = d6 * (s5 * (c1 * c23 * c4 + s1 + s4) + c1 * s23 * c5) + c1 * (
            a1 + a2 * c2 + a3 * c23 + d4 * s23
        )
     res[Tid, 1, 3] = d6 * (s5 * (s1 * c23 * c4 + c1 * s4) + s1 * s23 * c5) + s1 * (
            a1 + a2 * c2 + a3 * c23 + d4 * s23
        )
     res[Tid, 2, 3] = a2 * s2 + d1 + a3 * s23 - d4 * c23 + d6 * (s23 * c4 * s5 - c23 * c5)
     res[Tid, 3, 0] = float64(0.0)
     res[Tid, 3, 1] = float64(0.0)
     res[Tid, 3, 2] = float64(0.0)
     res[Tid, 3, 3] = float64(1.0)

[code written on a phone, no guarantees implied, use at own risk]

I.e. just have each thread create all the entries of one result matrix. This will get rid of all the branch divergence and let you achieve much higher utilisation of the GPU hardware if you select a sensible block size (basically round multiples of 32, see here for more details).

Is it optimal? No, because of memory coalescing issues. The next levels of optimisation to fix that are hard (and the Numba compiler is very feature poor compared to the NVidia C++ compiler), and I suspect you don’t want hard at this point. But that should get you much better performance than you are getting now.

Isn't that what the `@guvectorize` version by OP is doing already? Do you expect better performance doing the same with `@cuda.jit`? — paleonix, Jun 18 '23 at 09:05
In theory, yes, but I would be surprised if the Numba autovectorization can produce code which will run as well as a hand written kernel doing the same thing. I haven't looked at it for a few years, but it used to be pretty poor (not a criticism, it a minor miracle that it works as well as it does) — talonmies, Jun 19 '23 at 04:22

Cuda implementation unexpectedly too slow with Numba (Python)

1 Answers1