Replace loop with broadcasting in numpy -> memory error

Question

I have an 2D-array (array1), which has an arbitrary number of rows and in the first column I have strictly monotonic increasing numbers (but not linearly), which represent a position in my system, while the second one gives me a value, which represents the state of my system for and around the position in the first column.

Now I have a second array (array2); its range should usually be the same as for the first column of the first array, but does not matter to much, as you will see below. I am now interested for every element in array2: 1. What is the argument in array1[:,0], which has the closest value to the current element in array2? 2. What is the value (array1[:,1]) of those elements.

As usually array2 will be longer than the number of rows in array1 it is perfectly fine, if I get one argument from array1 more than one time. In fact this is what I expect. The value from 2. is written in the second and third column, as you will see below.

My striped code looks like this:

from numpy import arange, zeros, absolute, argmin, mod, newaxis, ones


ysize1 = 50
array1 = zeros((ysize1+1,2))


array1[:,0]   = arange(ysize1+1)**2      
    # can be any strictly monotonic increasing array 
array1[:,1]   = mod(arange(ysize1+1),2)  
    # in my current case, but could also be something else



ysize2      = (ysize1)**2
array2      = zeros((ysize2+1,3))
array2[:,0]   = arange(0,ysize2+1)                  
# is currently uniformly distributed over the whole range, but does not necessarily have to be


a = 0
for i, array2element in enumerate(array2[:,0]):
    a = argmin(absolute(array1[:,0]-array2element))
    array2[i,1] = array1[a,1]

It works, but takes quite a lot time to process large arrays. I then tried to implement broadcasting, which seems to work with the following code:

indexarray = argmin(absolute(ones(array2[:,0].shape[0])[:,newaxis]*array1[:,0]-array2[:,0][:,newaxis]),1)
array2[:,2]=array1[indexarray,1]    # just to compare the results

Unfortunately now I seem to run into a different problem: I get a memory error on the sizes of arrays I am using in the line of code with the broadcasting. For small sizes it works, but for larger ones where len(array2[:,0]) is something like 2**17 (and could be even larger) and len(array1[:,0]) is about 2**14. I get, that the size of the array is bigger than the available memory. Is there an elegant way around that or to speed up the loop? I do not need to store the intermediate array(s), I am just interested in the result.

Thanks!

score 0 · Accepted Answer · edited May 23 '17 at 10:26

First lets simplify this line:

argmin(absolute(ones(array2[:,0].shape[0])[:,newaxis]*array1[:,0]-array2[:,0][:,newaxis]),1)

it should be:

a = array1[:, 0]
b = array2[:, 0]
argmin(abs(a - b[:, newaxis]), 1)

But even when simplified, you're creating two large temporary arrays. If a and b have sizes M and N, b - a and abs(...) each create a temporary array of size (M, N). Because you've said that a is monotonically increasing, you can avoid the issue all together by using a binary search (sorted search) which is much faster anyways. Take a look at the answer I wrote to this question a while back. Using the function from this answer, try this:

closest = find_closest(array1[:, 0], array2[:, 0])
array2[:, 2] = array1[closest, 1]

Wow, thats an amazing speedup! :) Great solution and great explanation of how it works! Thank you very much! — freeone, Jul 11 '14 at 21:00

Replace loop with broadcasting in numpy -> memory error

1 Answers1