I have two numpy arrays x
and y
containing float values. For each value in x
, I want to find the closest element in y
, without reusing elements from y
. The output should be a 1-1 mapping of indices of elements from x to indices of elements from y. Here's a bad way to do it that relies on sorting. It removes each element that was paired from the list. Without sorting this would be bad because the pairing would depend on the order of the original input arrays.
def min_i(values):
min_index, min_value = min(enumerate(values),
key=operator.itemgetter(1))
return min_index, min_value
# unsorted elements
unsorted_x = randn(10)*10
unsorted_y = randn(10)*10
# sort lists
x = sort(unsorted_x)
y = sort(unsorted_y)
pairs = []
indx_to_search = range(len(y))
for x_indx, x_item in enumerate(x):
if len(indx_to_search) == 0:
print "ran out of items to match..."
break
# until match is found look for closest item
possible_values = y[indx_to_search]
nearest_indx, nearest_item = min_i(possible_values)
orig_indx = indx_to_search[nearest_indx]
# remove it
indx_to_search.remove(orig_indx)
pairs.append((x_indx, orig_indx))
print "paired items: "
for k,v in pairs:
print x[k], " paired with ", y[v]
I prefer to do it without sorting the elements first, but if they are sorted then I want to get the indices in the original, unsorted lists unsorted_x
, unsorted_y
. what's the best way to do this in numpy/scipy/Python or using pandas? thanks.
edit: to clarify I'm not trying to find the best fit across all elemets (not minimizing sum of distances for example) but rather the best fit for each element, and it's okay if it's sometimes at the expense of other elements. I assume that y
is generally much larger than x
contrary to above example and so there's usually many very good fits for each value of x
in y
, and I just want to find that one efficiently.
can someone show an example of scipy kdtrees for this? The docs are quite sparse
kdtree = scipy.spatial.cKDTree([x,y])
kdtree.query([-3]*10) # ?? unsure about what query takes as arg