I am attempting to optimize the performance of a random Forward Kinematics equation in the field of robotics. In essence, this computational function accepts 6 joint values as input and produces a 4x4 array that represents the position and orientation of the robot's end effector. Each element of the array can be computed independently and involves a large combination of multiplications, sines, and cosines applied to the 6 input values. To generalize further, considering N configurations, each with 6 joints, the core function will receive a (N,6) input array and return a (N,4,4) array. Working examples are provided in the end of this post.
My goal is to compare the performance of these different implementations and find the speed limit when accelerating the forward kinematics calculation with Numba. I also included tiem execution for a 2_000_000 x 6 array:
@numba.guvectorize
(withtarget=cpu/parallel/cuda
)cpu
: 92msparallel
: 21mscuda
: 200ms (only processing time)
@cuda.jit
: 160ms (only processing time)
I am seeking suggestions to improve the performance of my CUDA
implementation for the forward kinematics calculation. About the CUDA
-based approaches:
@guvectorize
withtarget="cuda"
: I believe this implementation is already optimized as it is written with the smallest input dimensions, maximizing the number of workers forCUDA
. The@guvectorize
decorator doesn't provide much flexibility for further improvements I think.@cuda.jit
: I suspect there is room for improvement in this implementation, as I am relatively new toCUDA
programming. Here are a few ideas I have considered:[IMPLEMENTED ONE] Launching N blocks with 16 threads each and using conditional statements to assign each thread to compute a specific value in the 4x4 output array. However, I am aware that using conditional statements breaks the
CUDA
paradigm and may not be efficient.Launching separate kernels to compute each value in the 4x4 array. This would remove all if statements. However, this approach may introduce significant overhead due to the large number of kernel launches.
Therefore:
- Is this type of computation suitable for GPU? I have my doubts. Although it is computationally intensive, each position of the output array (4x4) requires the execution of totally different instructions.
- About the implementation with
@guvectorize
withtarget="cuda"
, is there a possibility to make it quicker? - About the implementation with
@cuda.jit
, what is the common approach people with more experience would address these type of computations?
Working code used (requires Numba and cuda installed):
from time import perf_counter
import numpy as np
from numba import bool_, cuda, float64, guvectorize
@guvectorize(
[(float64[:], bool_[:, :], float64[:, :])],
"(n), (m,m) -> (m,m)",
nopython=True,
target="cuda",
)
def guvectorized(joints, dummy, res):
q1, q2, q3, q4, q5, q6 = joints
d1, a1, a2, a3, d4, d6 = float64(10.), float64(2.), float64(3.), float64(3.), float64(2.), float64(5.)
c1, c2, c4, c5, c6 = np.cos(q1), np.cos(q2), np.cos(q4), np.cos(q5), np.cos(q6)
s1, s2, s4, s5, s6 = np.sin(q1), np.sin(q2), np.sin(q4), np.sin(q5), np.sin(q6)
s23, c23 = np.sin(q3 + q2), np.cos(q2 + q3)
res[0,0] = c6 * (c5 * (c1 * c23 * c4 + s1 * s4) - c1 * s23 * s5) + s6 * (
s1 * c4 - c1 * c23 * s4
)
res[1, 0] = c6 * (c5 * (s1 * c23 * c4 + c1 * s4) - s1 * s23 * s5) - s6 * (
c1 * c4 - s1 * c23 * s4
)
res[2,0] = c6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * s6
res[0,1] = s6 * (
c1 * s23 * s5 - c5 * (s1 * c23 * c4 + s1 * s4) + c6 * (s1 * c4 - c1 * c23 * s4)
)
res[1,1] = s6 * (s1 * s23 * s5 - c5 * (s1 * c23 * c4 + c1 * s4)) - c6 * (
c1 * c4 + s1 * c23 * s4
)
res[2,1] = -s6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * c6
res[0,2] = s5 * (c1 * c23 * c4 + s1 * s4) + c1 * s23 * c5
res[1,2] = s5 * (s1 * c23 * c4 - c1 * s4) + s1 * s23 * c5
res[2,2] = s23 * c4 * c5 - c23 * c5
res[0,3] = d6 * (s5 * (c1 * c23 * c4 + s1 + s4) + c1 * s23 * c5) + c1 * (
a1 + a2 * c2 + a3 * c23 + d4 * s23
)
res[1,3] = d6 * (s5 * (s1 * c23 * c4 + c1 * s4) + s1 * s23 * c5) + s1 * (
a1 + a2 * c2 + a3 * c23 + d4 * s23
)
res[2,3] = a2 * s2 + d1 + a3 * s23 - d4 * c23 + d6 * (s23 * c4 * s5 - c23 * c5)
res[3, 0] = 0.0
res[3, 1] = 0.0
res[3, 2] = 0.0
res[3, 3] = 1.0
@cuda.jit
def cuda_jit(joints, res):
i = cuda.threadIdx.x
block_id = cuda.blockIdx.x
q1, q2, q3, q4, q5, q6 = joints[block_id]
d1, a1, a2, a3, d4, d6 = float64(10.), float64(2.), float64(3.), float64(3.), float64(2.), float64(5.)
c1, c2, c4, c5, c6 = np.cos(q1), np.cos(q2), np.cos(q4), np.cos(q5), np.cos(q6)
s1, s2, s4, s5, s6 = np.sin(q1), np.sin(q2), np.sin(q4), np.sin(q5), np.sin(q6)
s23, c23 = np.sin(q3 + q2), np.cos(q2 + q3)
if i == 0:
res[block_id, 0, 0] = c6 * (c5 * (c1 * c23 * c4 + s1 * s4) - c1 * s23 * s5) + s6 * (
s1 * c4 - c1 * c23 * s4
)
elif i == 1:
res[block_id, 1, 0] = c6 * (c5 * (s1 * c23 * c4 + c1 * s4) - s1 * s23 * s5) - s6 * (
c1 * c4 - s1 * c23 * s4
)
elif i == 2:
res[block_id, 2, 0] = c6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * s6
elif i == 3:
res[block_id, 0, 1] = s6 * (
c1 * s23 * s5 - c5 * (s1 * c23 * c4 + s1 * s4) + c6 * (s1 * c4 - c1 * c23 * s4)
)
elif i == 4:
res[block_id, 1, 1] = s6 * (s1 * s23 * s5 - c5 * (s1 * c23 * c4 + c1 * s4)) - c6 * (
c1 * c4 + s1 * c23 * s4
)
elif i == 5:
res[block_id, 2, 1] = -s6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * c6
elif i == 6:
res[block_id, 0, 2] = s5 * (c1 * c23 * c4 + s1 * s4) + c1 * s23 * c5
elif i == 7:
res[block_id, 1, 2] = s5 * (s1 * c23 * c4 - c1 * s4) + s1 * s23 * c5
elif i == 8:
res[block_id, 2, 2] = s23 * c4 * c5 - c23 * c5
elif i == 9:
res[block_id, 0, 3] = d6 * (s5 * (c1 * c23 * c4 + s1 + s4) + c1 * s23 * c5) + c1 * (
a1 + a2 * c2 + a3 * c23 + d4 * s23
)
elif i == 10:
res[block_id, 1, 3] = d6 * (s5 * (s1 * c23 * c4 + c1 * s4) + s1 * s23 * c5) + s1 * (
a1 + a2 * c2 + a3 * c23 + d4 * s23
)
elif i == 11:
res[block_id, 2, 3] = a2 * s2 + d1 + a3 * s23 - d4 * c23 + d6 * (s23 * c4 * s5 - c23 * c5)
elif i == 12:
res[block_id, 3, 0] = float64(0.0)
res[block_id, 3, 1] = float64(0.0)
res[block_id, 3, 2] = float64(0.0)
res[block_id, 3, 3] = float64(1.0)
if __name__ == "__main__":
# Setup
N = 1_000_000
# Guvectorized arrays
input_arr = np.random.rand(N, 6)
dummy_arr = np.empty((4, 4), dtype=np.bool_)
out = guvectorized(input_arr, dummy_arr)
# Cuda.jit arrays
input_arr_cu = cuda.to_device(input_arr)
output_arr_cu = cuda.device_array((N, 4, 4))
threads_per_block = 13
blocks_per_grid = N
cuda_jit[blocks_per_grid, threads_per_block](input_arr_cu, output_arr_cu)
# Timing
# Guvectorize
init = perf_counter()
guvectorized(input_arr, dummy_arr)
print(f"Time `guvectorize`: {perf_counter() - init}")
# Cuda jit
init = perf_counter()
cuda_jit[blocks_per_grid, threads_per_block](input_arr_cu, output_arr_cu)
cuda.synchronize()
print(f"Time `cuda.jit`: {perf_counter() - init}")