Can this NumPy operation perform as fast, or faster, than its Cython Equivalent?

Question

TLDR; I'm performing an array operation (no mathematics) and I've found Cython to be significantly faster. Is there a way I can speed this up in NumPy; or Cython?

Context

I'm writing a function that is meant to take a subset of an NxN array from index onward in both directions (whose top corner is along the diagonal) and shift it one place upwards along the diagonal. Secondly, I need to shift the top row from index onward one place to the left. Lastly, I need to set the last column in the array to zero after the operation.

The array is a strictly upper triangular matrix meaning that everything from the diagonal downwards is set to 0. This is my attempt at an elegant way to store historical collision data between pairs of objects (whose indices are represented by indices in the matrix). This would be similar to making a nested list of size n!/(2(n-2)!) which represents the ordered pairs of a list of indices of length n. In this algorithm, I hope to "remove" an object from the collision pairing matrix.

The advantage I find in this implementation is that "removing a collision pair" from the matrix is much less computationally intensive than removing pairs from a nested list and shifting the indices in pairs past the "index to remove" point.

The overall project centers around the automated "packing" of 3D models into a build volume for powder bed fusion additive manufacturing. The algorithm uses simulated annealing, so the ability to prune a collision set, store historical information, add/remove geometry are of upmost importance and need to be well optimized.

Example

Lets say our array takes this form (not representative of actual data).

arr = 
[[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
 [0. 0. 2. 3. 4. 5. 6. 7. 8. 9.]
 [0. 0. 0. 3. 4. 5. 6. 7. 8. 9.]
 [0. 0. 0. 0. 4. 5. 6. 7. 8. 9.]
 [0. 0. 0. 0. 0. 5. 6. 7. 8. 9.]
 [0. 0. 0. 0. 0. 0. 6. 7. 8. 9.]
 [0. 0. 0. 0. 0. 0. 0. 7. 8. 9.]
 [0. 0. 0. 0. 0. 0. 0. 0. 8. 9.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 9.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

Then using index = 3 we should take everything in the subset index+1:n, index+1:n and set it equal to index:n-1, index:n-1. Then shift the top row to the left; again from index onward. Then set the last column to 0.

fun(3, arr)
[[0. 1. 2. 4. 5. 6. 7. 8. 9. 0.]
 [0. 0. 2. 3. 4. 5. 6. 7. 8. 0.]
 [0. 0. 0. 3. 4. 5. 6. 7. 8. 0.]
 [0. 0. 0. 0. 5. 6. 7. 8. 9. 0.]
 [0. 0. 0. 0. 0. 6. 7. 8. 9. 0.]
 [0. 0. 0. 0. 0. 0. 7. 8. 9. 0.]
 [0. 0. 0. 0. 0. 0. 0. 8. 9. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 9. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

Implementation 1: Pure NumPy

Again assuming arr is an NxN matrix.

def fun(index, n, arr):
     arr[index:-1, index:-1] = arr[index + 1:, index + 1:]
     arr[0, index:-1] = arr[0, index + 1:]
     arr[:, n-1:] = 0
     return arr

Implementation 2: Cython

Bear with me as this is my very first time implementing Cython.

@cython.boundscheck(False)
def remove_from_collision_array(int index, int n, double[:,:] arr):

    cdef int i, j, x_shape, y_shape
    x_shape = arr.shape[0]

    for i in range(index, x_shape):
        for j in range(index, x_shape):
            if j <= i:
                # We are below the diagonal, do nothing
                continue
            elif i >= n-1 or j >= n-1:
                arr[i, j] = 0
            else:
                arr[i, j] = arr[i+1, j+1]

    arr[0, index:-1] = arr[0, index+1:]
    arr[:, n-1:] = 0

    return np.asarray(arr)

Discussion

Before anybody gets upset, yes I don't know what I'm doing in Cython. I disabled bounds_checking because it really really speeds things up. And I'm performing a bounds check in the loop with one of my elif statements.

I initially thought there would be no way that performing this operation in a loop would be faster than NumPy. I pre-allocate a NumPy array of size 5000x5000 to avoid needing to append, etc on the fly. I even tested the Cython implementation using the same 3 lines as the Numpy one, but it also performs poorly.

You can see that using index=0 will require the most computation. So I use that as a benchmark. While testing this in a loop, I've found that the Cython implementation is about 50%+ faster than the Numpy version. Perhaps this is because I am not adequately using the tools NumPy has to offer?

I am by no means a computer scientist, nor do I know if this is the best route. I'm a designer prototyping a system. If anybody has any insight on how to make this scream even faster, please let me know!

Update on the answer

Thanks to Jerome for teaching me something today! This will be instrumental in making this package run at lightning speed. I've added his insights to my code, resulting in a massive performance boost for two reasons that I can see:

I've cut the number of loop iterations by n*(n-1)/2 by starting the j-loop above the diagonal.
I've removed all conditional statements.

Here is the updated Cython:

@cython.boundscheck(False)
@cython.wraparound(False)
def remove_from_collision_arrayV2(int index, int n, double[:,:] arr):

    cdef int i, j

    # Shift the diagonal matrix
    for i in range(index, n-1):
        for j in range(i, n-1):
                arr[i, j] = arr[i+1, j+1]

    # Shift the rop row
    for j in range(index, n-1):
        arr[0, j] = arr[0, j+1]

    # Set Column column n-1 to zero
    for i in range(n):
        arr[i, n-1] = 0

    return np.asarray(arr)

For benchmarking purposes. Performing this iteration 500 times using index=0 on a 500x500 matrix:

Original NumPy Code: 52.8s

Original Cython Code: 16.47s - 3.2x Speedup

Updated Cython Code: 0.014s - 3550x Speedup

A big point of Cython is to make indexing in loops really efficient, so it isn't hugely surprising that it beats Numpy. You could possibly also gain a bit by turning `wraparound` off (if you never use negative indices) — DavidW, Nov 25 '21 at 07:07
The assumption, that numpy has/could have optimal performance is wrong. There are a lot of tools that improve numpy's performance - like `numexpr` or `numba`. Also code produced by Cython could be faster than numpy. See for example https://stackoverflow.com/a/54312134/5769463 — ead, Nov 25 '21 at 10:17
@DavidW thank you for the insight, I will disable `wraparound` as well. — ddm-j, Nov 25 '21 at 17:14
@ead I'm not that surprised Cython is faster in this instance. I was more wondering if there was a way to do this closer to Cython speed using more of NumPy's array operation methods. — ddm-j, Nov 25 '21 at 17:15

score 3 · Accepted Answer · answered Nov 25 '21 at 11:30

The reason why the expression arr[index:-1, index:-1] = arr[index + 1:, index + 1:] is slow in both Numpy and Cython and the Cython code is much faster is a bit counter intuitive: this expression is not efficiently implemented in both Numpy and Cython.

Indeed, Numpy copy the right-hand side (arr[index + 1:, index + 1:]) in a temporary array allocated on-the-fly. The temporary array is then copied to the left-hand side (arr[index:-1, index:-1]). This means that two memory copy are performed while only one could be used. It is even worse: the copied memory is pretty big and will not fit in the cache resulting in a bigger overhead (on some processors, like the mainstream x86/x86-64 ones, the write-back policy cause additional slow reads). Moreover, the new temporary array will cause many page fault slowing down even more the copy.

Numpy do this because the left-hand side and the right-hand side may overlap (which is the case here) and thus the order in which the memory bytes are copied matter a lot. Numpy use a slow conservative approach rather than an optimized implementation. This is a missed optimization. Cython does exactly the same thing.

Your Cython code do not suffer from all these overheads: it directly copy the array in-place relatively efficiently. The value read are kept in the cache and then written just after so that the write-back policy is not an issue. Moreover, there are no temporary array nor page faults. Finally, your Cython code do not copy the lower-part of the triangular matrix resulting in fewer bytes to be copied compared to the expression previously mentioned.

One way to reduce the overhead of the Numpy expression is to copy the matrix chunk-by-chunk and allocates a small temporary buffer for that (typically few lines of the matrix). However, this is far from being easy since CPython loops are generally very slow and the chunk size should fit in cache so the method can be useful...

Further optimization: conditionals are slow. You can remove them by starting the j-based loop at i+1 and ending it at n-1. Another j-based loop can then fill the value greater than n-1. For the same reason, the i-based loop should end at n-1 and another loop can then fill the remaining part of the array. A good compiler should use faster SIMD instructions.

Thanks a lot for your answer! 3550x improvement over my NumPy implementation. — ddm-j, Nov 25 '21 at 17:15

Can this NumPy operation perform as fast, or faster, than its Cython Equivalent?

1 Answers1

Linked