I have a performance issue while "filtering" an array according to the closest float found in another array.
This is a MWE
of the problem:
import numpy as np
def random_data(N):
# Generate some random data.
return np.random.uniform(0., 10., N).tolist()
# Data lists.
N1 = 1500
list1 = [random_data(N1), random_data(N1), random_data(N1)]
list2 = random_data(1000)
# Define list1's range.
min_1, max_1 = min(list1[2]), max(list1[2])
# This list will contain the "filtered" list1.
list4 = [[], [], []]
# Go through each element in list2.
for elem2 in list2:
# If it is located within the list1 range.
if min_1 <= elem2 <= max_1:
# Find the closest float in sub-list list1[2] to this float
# in list2.
indx, elem1 = min(enumerate(list1[2]), key=lambda x:abs(x[1]-elem2))
# Store the values in list1 that are associated with the closest float
# found above.
list4[0].append(list1[0][indx])
list4[1].append(list1[1][indx])
list4[2].append(elem1)
(note that list2
contains fewer elements than list1[2]
, which is the sub-list I compare it to)
This block works as expected but it is terribly inefficient. I'm certain that the answer lies in the correct application of broadcasting and numpy
arrays, but I still haven't managed to get the hang of the former enough to apply it to my problem.
Since I'm after enhancing the performance of this code any solution will do (ie: I'm not bound by an answer making use of broadcasting necessarily)
Add
As a reference, in this similar question I made some time ago Fast weighted euclidean distance between points in arrays, user ali_m used broadcasting to achieve an amazing improvement in performance.
The question is not exactly the same (euclidean distance there instead of absolute value and also the distances in that question had to be weighted) but this question seems to me even simpler that that one.
Couldn't the broadcasting solution ali_m applied to that issue be applied to this one?
Add 2
The answer given by user2357112 with the correction by Eelco Hoogendoorn works very good for my initially defined code. I just realized that I over-simplified it, in my actual code the lists list1[2]
and list2
are not necessarily defined in the same range. This would be a more accurate representation of that (this should replace the first lines in the MWE
above):
def random_data(N, xi, xf):
# Generate some random data.
return np.random.uniform(xi, xf, N).tolist()
# Data lists.
N1 = 1500
list1 = [random_data(N1, 13., 20.), random_data(N1, -1., 4.), random_data(N1, 2., 7.)]
list2 = random_data(1000, 0., 10.)
Now the range of list1[2]
is not equal to the range for list2
and thus the answer given fails to reject those points i
for which list2[i] > max(list1[2])
or list2[i] < min(list1[2])
.
Could the answer be modified to have consider this possibility? I'm very sorry for changing the original code like this, it truly slipped by me.