0

I have two lists(or columns of dataframe - doesn't matter) of numerical values:

L1 = [1, 0.5, 3, 7, 4.7]
L2 = [2, 0.4, 8, 0.3, 5]

for example. I need to correlate these lists and find pairs(indexOfElementFromL1, indexOfElementFromL2) which corresponds min diff of two elements. For example, for my example it should be: (0,1), (1,1), (2,0), (3,2), (4,4). Really what I want - find closest element L2 to each one of L1. Of course I could naive approch like:

for el1 in L1:
  for el2 in L2:
    ....

But I'd to like to see more correct solution

Roman Kazmin
  • 931
  • 6
  • 18

4 Answers4

4

You can do it really fast by converting your lists to numpy arrays and using numpy broadcasting:

a1 = np.array(L1)
a2 = np.array(L2)

indexes = abs(a1[:, None] - a2).argmin(axis=1)
out = list(enumerate(indexes))

Output:

>>> indexes
array([1, 1, 0, 2, 4])

>>> out
[(0, 1), (1, 1), (2, 0), (3, 2), (4, 4)]
  • I think there is no good answer to this question because lists are not sorted. But for all +1 – Corralien Feb 09 '22 at 20:37
  • @Corralien I disagree - this is, in fact, the best solution. The arrays don't need to be sorted for this to work - it uses numpy broadcasting and the wonderful `argmin` (synonymous with `idxmin` in Pandas) to efficiently compute what the OP wants. –  Feb 09 '22 at 20:39
  • Do you think numpy (fortran, C) doesn't iterate over the two lists? Maybe all of you have the fastest solution for a CPU but you have the same algorithm. Loop over 2 lists... – Corralien Feb 09 '22 at 20:44
  • Admittedly I'm not sure about numpy internals, but I don't believe it's looping over the lists (I think that's the idea behind broadcasting). It's the same algorithm as the others' solution but more vectorized (as much as possible). –  Feb 09 '22 at 20:45
  • I wrote that because the OP said he *could use naive approach* but the question is what is a good approach. – Corralien Feb 09 '22 at 20:46
  • Broadcasting is just a view of your numpy array in memory. How can you guess the closest value of first item of L1 is not the last item of L2 without iterate over all elements of L2? – Corralien Feb 09 '22 at 20:47
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/241881/discussion-between-richardec-and-corralien). –  Feb 09 '22 at 20:50
1

I don't see how an answer could be 'more correct', but this is how I'd do it. I like it more than a cramped one liner because it's still readable.

L1 = [1, 0.5, 3, 7, 4.7]
L2 = [2, 0.4, 8, 0.3, 5]


def find_closest(x):
    l2_diffs = [abs(x - y) for y in L2]
    return l2_diffs.index(min(l2_diffs))


res = [(i, find_closest(x)) for i, x in enumerate(L1)]

print(res)

Output:

[(0, 1), (1, 1), (2, 0), (3, 2), (4, 4)]
Tobi208
  • 1,306
  • 10
  • 17
1

taken from another answer; though I do not know how to flag for such:

Find nearest value in numpy array

import numpy as np

L1 = [1, 0.5, 3, 7, 4.7]
L2 = [2, 0.4, 8, 0.3, 5]

def find_nearest(array, value):
    array = np.asarray(array)
    idx = (np.abs(array - value)).argmin()
    return array[idx]

for el1 in L1:
     print(find_nearest(L2, el1))
Shawn Ramirez
  • 796
  • 1
  • 5
  • 10
1

You can use brute force methods using for loops or taking advantage of broadcasting, as you can see in other answers.

The problem comes when the lists are large. If that's the case, for loops will be slow and broadcasting will take a lot of memory.

The alternative is to use a search algorithm like FLANN:

>>> import numpy as np
>>> from pyflann import FLANN

>>> L1 = np.array([[1, 0.5, 3, 7, 4.7]]).T      # Data points as rows
>>> L2 = np.array([[2, 0.4, 8, 0.3, 5]]).T

>>> idx, _ = FLANN().nn(pts=L2, qpts=L1)
>>> idx
array([1, 1, 0, 2, 4], dtype=int32)

>>> list(enumerate(idx))
[(0, 1), (1, 1), (2, 0), (3, 2), (4, 4)]

Or, if you want to perform multiple queries:

>>> flann = FLANN()
>>> flann.build_index(pts=L2)           # Build search index
>>> idx, _ = flann.nn_index(qpts=L1)    # Query
aerobiomat
  • 2,883
  • 1
  • 16
  • 19