Finding similar entries in python lists

Question

I have 2 lists of tuples list1 = [(1.332, 3.23344, 3.22), (2.122, 2.11, 2.33), ... (1, 2, 3)] and list2 = [(4.23, 12.2, 3.333), (1.234, 3.21, 4.342), ... (1.1, 2.2, 3.3)]. These lists are both very long, somewhere in the millions for both lists. For context, each of these data points is some measure of position in two different datasets. Now I want to correspond each entry in list1 to an entry in list2 if it is "close enough". By close enough I mean the distance between the positions is less than some threshold value (say .1 for example). My initial thought was using the min function on each entry in list1. That is, the following:

import numpy as np
import random

def dist(pt1, pt2): 
    return np.sqrt( ((pt2[0] - pt1[0]) ** 2) + ((pt2[1] - pt1[1]) ** 2) + ((pt2[2] - pt1[2]) ** 2) ) 

list1 = [(random.random(), random.random(), random.random()) for _ in range(25)]                                                                                              

list2 = [(random.random(), random.random(), random.random()) for _ in range(20)]   

threshold = .5
linker = []
for i, entry in enumerate(list1): 
    m = min(list2, key=lambda x: dist(entry, x)) 
    if dist(entry, m) < threshold: 
         linker.append((i, list2.index(m))

So this would link each index in list1 to and index in list2. But I feel like there must be some already developed algorithm for this task specifically which is much faster, is there?

score 2 · Accepted Answer · answered May 21 '20 at 00:03

You're finding the nearest neighbor of each point in a dataset to a second dataset.

Your posted approach has complexity O(N^2)
Since N ~ 1 million, this becomes untenable.

For large datasets nearest neighbor approaches are much better since they have complexity O(N*log(N))

Two popular ones in Python are KDTree and BallTree

An example of solving this with BallTree

sklearn BallTree doc

import numpy as np
from sklearn.neighbors import BallTree

# Generate Dataset 1 (random positions in 3D)
rng = np.random.RandomState(0)
X = rng.random_sample((10, 3))  # 10 points in 3 dimensions

# Setup nearest neighbor tree  for dataset 1
# to process nearest neighbor queries
tree = BallTree(X, leaf_size=2)

# Generate Dataset 2 (random positions in 3D)
Y = rng.random_sample((10, 3))

# For each point in Dataset 2
# find the index and distance to the closest 
# point in Dataset 1 (using the nearest neighbor tree
# for dataset 1)
dist, ind = tree.query(Y, k=1)  # nearest neighbor  

# Results
for i, (ind, d) in enumerate(zip(ind, dist)):
  print(f'Y index {i}, closest index X is {ind}, dist {d}')

Output

Y index 0, closest index X is [3], dist [0.14046915]
Y index 1, closest index X is [1], dist [0.40653272]
Y index 2, closest index X is [7], dist [0.29291477]
Y index 3, closest index X is [1], dist [0.25785655]
Y index 4, closest index X is [1], dist [0.39477652]
Y index 5, closest index X is [9], dist [0.50373484]
Y index 6, closest index X is [1], dist [0.24894356]
Y index 7, closest index X is [4], dist [0.14716665]
Y index 8, closest index X is [5], dist [0.25875381]
Y index 9, closest index X is [8], dist [0.24204497]

score 0 · Answer 2 · answered May 20 '20 at 23:29

yes absolutely this is a time-consuming way to do this because first python is not optimized for these calculations(for data types and etc) and second these calculations need optimization in any language. you must use library for manipulating matrices such as numpy and pandas. for example in your case i recommend this solution: first: convert your data to a dataframe of pandas like this post: List of Tuples to DataFrame Conversion second: after that conversion with pandas this is a routin and easy calculation. for example: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pow.html

pandas uses numpy and numpy is optimized for these calculations.

score 0 · Answer 3 · answered May 20 '20 at 23:49

A simple solution involves keeping a 3d array of cells to group your entries into. For example, (1.332, 3.23344, 3.22) might be grouped into cell (13, 32, 32). Once that data structure has been packed, you can find all the points near (1.332, 3.23344, 3.22), by looking at (13, 32, 32) (and some subset of its 26 neighbors.)

If you really need this to be fast, you're getting into a set of algorithms called "Spacial Partitioners". You might look into a thing called a "kd-tree", which are ideal for storing non-uniform distributions of points in an ultra-compact fashion (and is optimized for retrieving points in the neighborhood for some location.)

Finding similar entries in python lists

3 Answers3

Linked