2

I have 2 arrays of unequal size:

>>> np.size(array1)
4004001
>>> np.size(array2)
1000 

Now, each element in array2 needs to be compared to all the elements in array1, to find the element which has the nearest value to that of this element in array2. Upon finding this value, I need to store it in a different array of size 1000 - one of a size corresponding to array2.

The tedious and crude way of doing it could be using a for loop and taking each element from Array 2, subtracting its absolute value from array 1 elements and then taking the minimum value- this is going to make my code really slow.

I'd like to use numpy vectorized operations to do this but i've kind of hit a wall.

johnnyRose
  • 7,310
  • 17
  • 40
  • 61
sb25
  • 21
  • 3
  • 1
    Sort both arrays first. Then step through large array, keeping an index to the currently closest element in the small array. Increment the index as required. I wouldn't be terribly surprised if there was something in itertools that would speed this up. – Cody Piersall Apr 22 '17 at 04:27
  • 1
    Possible duplicate of [find nearest value in numpy array](http://stackoverflow.com/questions/2566412/find-nearest-value-in-numpy-array) – Cody Piersall Apr 22 '17 at 04:54

3 Answers3

1

To make full use of the numpy parallelism we need vectorized functions. Further all values are found in the same array (array1) using the same criterium (nearest). Therefore, it is possible to make a special function for searching in array1 specifically.

However, to make the solution more reusable it is better to make a more general solution and then transform it into a more specific one. Thus, as a general approach to find the closest value, we start with this find nearest solution. Then we turn that into a more specific and vectorize it, to allow it to work on multiple element at once:

import math
import numpy as np
from functools import partial

def find_nearest_sorted(array,value):
    idx = np.searchsorted(array, value, side="left")
    if idx > 0 and (idx == len(array) or math.fabs(value - array[idx-1]) < math.fabs(value - array[idx])):
        return array[idx-1]
    else:
        return array[idx]

array1 = np.random.rand(4004001)
array2 = np.random.rand(1000)

array1_sorted = np.sort(array1)

# Partially apply array1 to find function, to turn the general function
# into a specific, working with array1 only.
find_nearest_in_array1 = partial(find_nearest_sorted, array1_sorted)

# Vectorize specific function to allow us to apply it to all elements of
# array2, the numpy way.
vectorized_find = np.vectorize(find_nearest_in_array1)

output = vectorized_find(array2)

Hopefully this is what you wanted, a new vector, mapping the data in array2 to the nearest values in array1.

Community
  • 1
  • 1
JohanL
  • 6,671
  • 1
  • 12
  • 26
  • And, since we are to look into `array1` multiple times (1000) it is beneficial to first sort the array, taking a one time cost of sorting, to speed up each find operation that follows. – JohanL Apr 22 '17 at 07:13
  • Thank you @JohanL and everyone for helping! I've never used functools before. This is great! – sb25 Apr 22 '17 at 19:36
0
import numpy as np
a = np.random.random(size=4004001).astype(np.float16)
b = np.random.random(size=1000).astype(np.float16)
#use numpy broadcasting to compare pairwise difference and then find the min arg in a for each element in b. Finally extract elements from a using the argmin array as indexes. 
output = a[np.argmin(np.abs(b[:,None] -a),axis=1)]

This solution while simple can be very memory intensive. It may need a bit further optimisation if using it on large arrays.

Allen Qin
  • 19,507
  • 8
  • 51
  • 67
  • The time and space complexity of this solution is rather large, since it expands the problem to a matrix with the dimensions 4004001x1000 and then it does not sort `array1`, making the making the find (`min´) operation slower than it needs to be. – JohanL Apr 22 '17 at 07:35
  • Yeah, I realized that and I'm thinking about ways to optimize it while maintaining its simplicity. – Allen Qin Apr 22 '17 at 07:36
  • Please also edit your answer to include some explanation. Code-only answers do very little to educate future SO readers. Your answer is in the moderation queue for being low-quality. – mickmackusa Apr 22 '17 at 08:04
0

The most "numpythonic" way is is to use broadcasting. This is a quick and easy way to calculate a distance matrix, for which you can then take the argmin of the absolute value.

array1 = np.random.rand(4004001)
array2 = np.random.rand(1000)

# Calculate distance matrix (on truncated array1 for memory reasons)
dmat = array1[:400400] - array2[:,None]

# Take the abs of the distance matrix and work out the argmin along the  last axis
ix = np.abs(dmat).argmin(axis=1)

shape of dmat:

(1000, 400400)

shape of ix and contents:

(1000,)    
array([237473, 166831,  72369,  11663,  22998,  85179, 231702, 322752, ...])

However, it's memory hungry if you do this operation in one go, and actually doesn't work on my 8GB machine for the size of arrays that you specify, which is why I reduced the size of array1.

To make it work within memory constraints, simply slice one of the arrays into chunks and apply broadcasting on each chunk in turn (or parallelise). In this case, I've sliced array2 into 10 chunks:

# Define number of chunks and calculate chunk size
n_chunks = 10
chunk_len = array2.size // n_chunks

# Preallocate output array
out = np.zeros(1000)

for i in range(n_chunks):
    s = slice(i*chunk_len, (i+1)*chunk_len)
    out[s] = np.abs(array1 - array2[s, None]).argmin(axis=1)
FuzzyDuck
  • 1,492
  • 12
  • 14
  • Your solution is still rather memory hungry, even with the chunking. It is also quite slow, as it the min operation is O(n), for unsorted lists. That is why I felt the need for a somewhat more complex approach, but with much improved time complexity. – JohanL Apr 22 '17 at 08:54
  • But it works and it's easy to understand. If speed and memory are an important issue for the OP that can't be resolved by parallelisation, then a more complex approach is justified. – FuzzyDuck Apr 22 '17 at 09:21