7

I have some board numpy arrays like that:

array([[0, 0, 0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 1, 0, 0]])

And I'm using the following code to find the sum of elements on each nth diagonal from -7 to 8 of the board (and the mirrored version of it).

n = 8
rate = [b.diagonal(i).sum()
        for b in (board, board[::-1])
        for i in range(-n+1, n)]

After some profiling, this operation is taking about 2/3 of overall running time and it seems to be because of 2 factors:

  • The .diagonal method builds a new array instead of a view (looks like numpy 1.7 will have a new .diag method to solve that)
  • The iteration is done in python inside the list comprehension

So, there are any methods to find these sums faster (possibly in the C layer of numpy)?


After some more tests, I could reduce 7.5x the total time by caching this operation... Maybe I was looking for the wrong bottleneck?


One more thing:

Just found the .trace method that replaces the diagonal(i).sum() thing and... There wasn't much improvement in performance (about 2 to 4%).

So the problem should be the comprehension. Any ideas?

JBernardo
  • 32,262
  • 10
  • 90
  • 115
  • Caching is the right way of your problem. But for the real bottlenect, I think is the python language. If you really want better performance for this operation, you need C. – Mayli May 29 '12 at 06:00
  • @Mayli Caching solved part of the problem. The profiling still says this is the most expensive calculation... – JBernardo May 29 '12 at 06:14
  • Rewrite the hotspot in C will always bring some performance, isn't? – Mayli May 29 '12 at 06:30
  • @Mayli Only if nothing else works. I'm still trying – JBernardo May 29 '12 at 06:40
  • Calculation needs time. So you should try avoid them. – Kabie May 29 '12 at 07:44
  • @Mayli - you are wrong! If JBernardo is using numpy, then going out from python is bad idea. numpy is well optimized and has really great environment. Thorough scipy you can box the code into waves if performance concerns. Or use PyPy exeprimental numpy support which is quiet good enough. – Robert Zaremba May 29 '12 at 09:08
  • Don't forget that numpy arrays are stored in row order, so the diagonals are not located adjacently in memory which results in a lot of cache misses if the array won't fit in the cache. This makes summing the diagonal quickly difficult. If you need to take the trace multiple times of your matrix, I might recommend extracting the diagonal and saving it separately to prevent cache misses while summing. – SethMMorton May 30 '12 at 00:47

2 Answers2

7

There's a possible solution using stride_tricks. This is based in part on the plethora of information available in the answers to this question, but the problem is just different enough, I think, not to count as a duplicate. Here's the basic idea, applied to a square matrix -- see below for a function implementing the more general solution.

>>> cols = 8
>>> a = numpy.arange(cols * cols).reshape((cols, cols))
>>> fill = numpy.zeros((cols - 1) * cols, dtype='i8').reshape((cols - 1, cols))
>>> stacked = numpy.vstack((a, fill, a))
>>> major_stride, minor_stride = stacked.strides
>>> strides = major_stride, minor_stride * (cols + 1)
>>> shape = (cols * 2 - 1, cols)
>>> numpy.lib.stride_tricks.as_strided(stacked, shape, strides)
array([[ 0,  9, 18, 27, 36, 45, 54, 63],
       [ 8, 17, 26, 35, 44, 53, 62,  0],
       [16, 25, 34, 43, 52, 61,  0,  0],
       [24, 33, 42, 51, 60,  0,  0,  0],
       [32, 41, 50, 59,  0,  0,  0,  0],
       [40, 49, 58,  0,  0,  0,  0,  0],
       [48, 57,  0,  0,  0,  0,  0,  0],
       [56,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  7],
       [ 0,  0,  0,  0,  0,  0,  6, 15],
       [ 0,  0,  0,  0,  0,  5, 14, 23],
       [ 0,  0,  0,  0,  4, 13, 22, 31],
       [ 0,  0,  0,  3, 12, 21, 30, 39],
       [ 0,  0,  2, 11, 20, 29, 38, 47],
       [ 0,  1, 10, 19, 28, 37, 46, 55]])
>>> diags = numpy.lib.stride_tricks.as_strided(stacked, shape, strides)
>>> diags.sum(axis=1)
array([252, 245, 231, 210, 182, 147, 105,  56,   7,  21,  42,  70, 105,
       147, 196])

Of course, I have no idea how fast this will actually be. But I bet it will be faster than a Python list comprehension.

For convenience, here's a fully general diagonals function. It assumes you want to move the diagonal along the longest axis:

def diagonals(a):
    rows, cols = a.shape
    if cols > rows:
        a = a.T
        rows, cols = a.shape
    fill = numpy.zeros(((cols - 1), cols), dtype=a.dtype)
    stacked = numpy.vstack((a, fill, a))
    major_stride, minor_stride = stacked.strides
    strides = major_stride, minor_stride * (cols + 1)
    shape = (rows + cols - 1, cols)
    return numpy.lib.stride_tricks.as_strided(stacked, shape, strides)
Community
  • 1
  • 1
senderle
  • 145,869
  • 36
  • 209
  • 233
  • 1
    This is faster than the `.trace` method! – JBernardo May 30 '12 at 01:10
  • Actually, I can discard the diagonals -7 and 7 from each board because they do not affect the result. But even with them, this method (and a dot product `dot(ones(8), diagonals(board).T)`), I can make the `sum` 10 to 15% faster. – JBernardo May 30 '12 at 01:11
  • I edited `diagonals` to make it fully general; it should work correctly on all 2-d arrays. – senderle May 30 '12 at 12:30
  • 1
    Thanks. I only have square matrices of the same size. So I also use the same `fill` array instead of creating one at each iteration. This function is called about 100K times in 2 or 3 seconds. – JBernardo May 30 '12 at 18:16
  • 1
    BTW, the `reshape` operation could be replaced by: `numpy.zeros(((cols - 1), cols), ...)` – JBernardo May 30 '12 at 18:16
  • Great answer, thanks! I would like to add something. There are a couple issues here that have to do with the fact you started with a square matrix. first of all, it should be shape = (rows * 2 - 1, cols). Second of all, if your cols > rows, then the last diagonal will go outside the "stacked" array. So you have to make another filler that has max(0, cols-rows-1) rows and add it at the end: stacked = numpy.vstack((a, fill, a, fill2)). Finally it's dtype='int8' not 'i8'. Maybe it was 'i8' back then. Mods: could you add this to the answer please? – cheater Oct 14 '16 at 04:51
  • @cheater, yeah, I don't see what you're getting at. Are you talking about the code for the function, or the code entered at the interpreter? The function is fully generalized to handle non-square arrays. I'll edit the question to make things clearer. – senderle Oct 14 '16 at 13:52
  • @cheater, also, I don't see any evidence that `i8` is wrong in any way. Where are you getting that? – senderle Oct 14 '16 at 14:01
  • i went by the code you entered into the interpreter. That's where the errors were. – cheater Oct 18 '16 at 01:46
  • You're right, i8 is a proper argument for dtype, on my computer it's an alias for int64. – cheater Oct 18 '16 at 01:50
2

As I posted in a comment, I wouldn't go into C code.

Try to go with PyPy. Actually it's numpy support is quiet good (however it not support directly array.diagonal) - I didn't check if there is other buidin method for that. Nerveless, I tried the following code:

try:
    import numpypy  # required by PyPy
except ImportError:
    pass
import numpy

board = numpy.array([[0, 0, 0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 1, 0, 0]])

n=len(board)
def diag_sum(i, b):
    s = 0
    if i>=0:
        row = 0
        end = n
    else:
        row = -i
        end = n+i
        i = 0
    while i<end:
        s += b[row, i]
        i+=1
        row+=1
    return s


import time
t=time.time()
for i in xrange(50000):
    # rate = [b.diagonal(i).sum()
    #         for b in (board, board[::-1])
    #         for i in range(-n+1, n)]

    rate = [diag_sum(i,b)
            for b in (board, board[::-1])
            for i in range(-n+1, n)]

print time.time() - t

The results are:

  • 0.64s PyPy with diag_sum
  • 6.01s CPython version with diag_sum
  • 5.60s CPython version with b.diagonal
Robert Zaremba
  • 8,081
  • 7
  • 47
  • 78
  • Did you try the version from trunk? With PyPy you are calling normal numpy, you normally shouldn't use numpypy more then import it before numpy. But stay in touch since numpy support is under active development in PyPy – Robert Zaremba May 29 '12 at 19:52