6

Let say I have two large 2-d numpy array of same dimensions (say 2000x2000). I want to sum them element wise. I was wondering if there is a faster way than np.add()

Edit: I am adding a similar example of what I am using now. Is there a way to speed up this?

#a and b are the two matrices I already have.Dimension is 2000x2000
#shift is also a list that is previously known
for j in range(100000):
    b=np.roll(b, shift[j] , axis=0)
    a=np.add(a,b)
Divakar
  • 218,885
  • 19
  • 262
  • 358
Anirban Dutta
  • 195
  • 2
  • 11
  • Adding 2000x2000 arrays is only 4 million elements, and adding is pretty straightforward from the hardware side. I don't think you get significant speed up by optimizing this operation. Are you concerned about performance of your code? Is this in the middle of a critical loop? – Jonathan Wheeler Dec 18 '16 at 05:32
  • @JonathanWheeler Actually you are correct. It is inside a loop and the addition is done about 200 times a sec. This addition alone consumes more than 50% of total running time. Any sugesstions? – Anirban Dutta Dec 18 '16 at 05:36
  • 1
    One possible speed up is to add the numbers as they are calculated. Have you considered using the `Threading` or `Subprocess` library? (Those might not be the names) – Jonathan Wheeler Dec 18 '16 at 05:41
  • 5
    @AnirbanDutta, You may be able to shave off some time by reducing copies of the arrays. For example, x += y is faster than z = x+y. –  Dec 18 '16 at 05:43
  • 1
    there isn't much optimization that can go into the addition operation, given more context of the program we can offer alternatives (like threading) but as the question stands I really don't think there is very much to be said. – Tadhg McDonald-Jensen Dec 18 '16 at 05:45
  • As usual, if adding the arrays is your bottleneck, then you won't have time to process the resulting arrays. Are you sure you need to add all elements ? –  Dec 18 '16 at 13:54
  • I can attest that @user4322543's suggestion to reduce copies can significantly speed up the operation for very large arrays! – dwagnerkc Jan 31 '20 at 19:21

1 Answers1

7

Approach #1 (Vectorized)

We can use modulus to simulate the circulating behavior of roll/circshift and with broadcasted indices to cover all rows, we would have a fully vectorized approach, like so -

n = b.shape[0]
idx = n-1 - np.mod(shift.cumsum()[:,None]-1 - np.arange(n), n)
a += b[idx].sum(0)

Approach #2 (Loopy one)

b_ext = np.row_stack((b, b[:-1] ))
start_idx = n-1 - np.mod(shift.cumsum()-1,n)
for j in range(start_idx.size):
    a += b_ext[start_idx[j]:start_idx[j]+n]

Colon notation vs using indices for slicing

The idea here to do minimal work once we are inside the loop. We are pre-computing the start row index of each iteration before going into the loop. So, all we need to do once inside the loop is slicing using colon notation, which is a view into the array and adding up. This should be much better than rolling that needs to compute all of those row indices that results in a copy that is expensive.

Here's a bit more into the view and copy concepts when slicing with colon and indices -

In [11]: a = np.random.randint(0,9,(10))

In [12]: a
Out[12]: array([8, 0, 1, 7, 5, 0, 6, 1, 7, 0])

In [13]: a[3:8]
Out[13]: array([7, 5, 0, 6, 1])

In [14]: a[[3,4,5,6,7]]
Out[14]: array([7, 5, 0, 6, 1])

In [15]: np.may_share_memory(a, a[3:8])
Out[15]: True

In [16]: np.may_share_memory(a, a[[3,4,5,6,7]])
Out[16]: False

Runtime test

Function defintions -

def original_loopy_app(a,b):
    for j in range(shift.size):
        b=np.roll(b, shift[j] , axis=0)
        a += b

def vectorized_app(a,b):
    n = b.shape[0]
    idx = n-1 - np.mod(shift.cumsum()[:,None]-1 - np.arange(n), n)
    a += b[idx].sum(0)

def modified_loopy_app(a,b):
    n = b.shape[0]
    b_ext = np.row_stack((b, b[:-1] ))
    start_idx = n-1 - np.mod(shift.cumsum()-1,n)
    for j in range(start_idx.size):
        a += b_ext[start_idx[j]:start_idx[j]+n]

Case #1:

In [5]: # Setup input arrays
   ...: N = 200
   ...: M = 1000
   ...: a = np.random.randint(11,99,(N,N))
   ...: b = np.random.randint(11,99,(N,N))
   ...: shift = np.random.randint(0,N,M)
   ...: 

In [6]: original_loopy_app(a1,b1)
   ...: vectorized_app(a2,b2)
   ...: modified_loopy_app(a3,b3)
   ...: 

In [7]: np.allclose(a1, a2) # Verify results
Out[7]: True

In [8]: np.allclose(a1, a3) # Verify results
Out[8]: True

In [9]: %timeit original_loopy_app(a1,b1)
   ...: %timeit vectorized_app(a2,b2)
   ...: %timeit modified_loopy_app(a3,b3)
   ...: 
10 loops, best of 3: 107 ms per loop
10 loops, best of 3: 137 ms per loop
10 loops, best of 3: 48.2 ms per loop

Case #2:

In [13]: # Setup input arrays (datasets are exactly 1/10th of original sizes)
    ...: N = 200
    ...: M = 10000
    ...: a = np.random.randint(11,99,(N,N))
    ...: b = np.random.randint(11,99,(N,N))
    ...: shift = np.random.randint(0,N,M)
    ...: 

In [14]: %timeit original_loopy_app(a1,b1)
    ...: %timeit modified_loopy_app(a3,b3)
    ...: 
1 loops, best of 3: 1.11 s per loop
1 loops, best of 3: 481 ms per loop

So, we are looking at 2x+ speedup there with the modified loopy approach!

Divakar
  • 218,885
  • 19
  • 262
  • 358