Fastest way to compute many 3x3 matrix-matrix multiplications

Question

I need to compute the combination of many 3x3 rotation matrices.

Here is a comparison of applying functools.reduce on matmul with numpy and cupy:

import timeit
from functools import reduce
import numpy as np
import cupy as cp
from pyrr.matrix33 import create_from_axis_rotation

# generate random rotation matrices
axes = np.random.rand(10000, 3)
angles = np.pi * np.random.rand(10000)
rotations = [create_from_axis_rotation(*params) for params in zip(axes, angles)]

# then reduce with matmul

xp = np # numpy
xp_rotations = [xp.asarray(rotation) for rotation in rotations]
timexp = timeit.timeit("reduce(xp.matmul, xp_rotations)", number=10, globals=globals())
print(f"{xp.__name__}: {timexp * 1000:0.3}ms")

xp = cp # cupy
xp_rotations = [xp.asarray(rotation) for rotation in rotations]
timexp = timeit.timeit("reduce(xp.matmul, xp_rotations)", number=10, globals=globals())
print(f"{xp.__name__}: {timexp * 1000:0.3}ms")

On a good machine with a Titan GPU, this gives :

numpy: 1.63e+02ms
cupy: 8.78e+02ms

For some reason the GPU is much slower.

In any case, is there a way to calculate this significantly faster ?

Edit

I found a rather simple solution, that works for all chains of small linear transformations (and can be extended to affine transformations easily).


def reduce_loop(matrices):
    """ non-optimized reduce """
    mat = matrices[0]
    for _mat in matrices[1:]:
        mat = mat @ _mat
    return mat

def reduce_split(matrices): 
    """ reduce by multiplying pairs of matrices recursively """
    if len(matrices) == 1:
        return matrices[0]
    neven = (len(matrices) // 2) * 2
    reduced = matrices[:neven:2] @ matrices[1:neven:2]
    if len(matrices) > neven:  # len(matrices) is odd
        reduced[-1] = reduced[-1] @ matrices[-1]
    return reduce_split(reduced)

time = timeit.timeit("reduce_loop(rotations)", number=10, globals=globals())
print(f"reduce_loop: {time * 1000:0.3}ms")

time = timeit.timeit("reduce_split(rotations)", number=10, globals=globals())
print(f"reduce_split: {time * 1000:0.3}ms")

Giving:

reduce_loop: 2.14e+02ms
reduce_split: 24.5ms

I'm sure it's not optimal, but it uses numpy's (and probably cupy's) optimization.

Check out [this question](https://stackoverflow.com/questions/8919086/why-are-quaternions-used-for-rotations#:~:text=For%20quaternions%20versus%20a%203x3%20rotation%20matrix%2C%20the,these%20representations%20of%20rotations%20are%20used%20in%20practice.) on quaternion. — Quang Hoang, Oct 28 '20 at 17:45
@QuangHoang thanks, that was interesting, even though I did not feel like reimplementing everything using quaternions. However, it seems there is still a debate on whether quaternions multiplications are really faster that matrix multiplications. — piliv, Oct 30 '20 at 18:19

Stripedbass · Answer 1 · 2020-10-29T14:44:05.593

1

functools.reduce() was removed from core python because it is inefficient and not pythonic. There is no cuPy equivalent, only the host version in the functools library
your cuPy code is spending most of its time fruitlessly copying data from host to device and back again... thousands of times - because reduce() runs only on the host not on the GPU. You are straining your PCI bus, not the GPU
consider making the list “rotations” into a cuPy matrix, and then use striding (not a python list)
use a cuPy reduction kernel to do the matmul https://docs.cupy.dev/en/stable/reference/generated/cupy.ReductionKernel.html

edited Oct 29 '20 at 14:44

answered Oct 29 '20 at 00:07

Stripedbass

194
1
8

Thanks. Regarding 1 and 2: exactly when the GPU transfers data to CPU is still unclear to me. The pseudo reduce functions I implemented (cf edit) does not fare better on the GPU, and should not transfer the data back in each loop. However, it is true that the GPU is really not useful for running 3x3 matrix multiplications. 4. Could not find a way to use these functions (neither `cupy.fuse`, which seems to be simpler). It's still lacking some documention. – piliv Oct 30 '20 at 18:27

Fastest way to compute many 3x3 matrix-matrix multiplications

Edit

1 Answers1