I have an extremely large list of coordinates in the form of a list of tuples.
data = [(1,1),(1,11),(1,21),(11,1),(21,1),(11,11),(11,21),(21,11),(21,21),(1,2),(2,1)]
The list of tuple is actually being formed by a for loop with an append command like so:
data = []
for i in source: # where i a tuple of form (x,y)
data.append(i)
Is there an approach to ensure euclidean distance between all tuples is above a certain threshold? In this example there is a very small distance between (1,1),(1,2),(2,1). In this scenario I would like to keep only one of the 3 tuples. Resulting in either one of these new list of tuples:
data = [(1,1),(1,11),(1,21),(11,1),(21,1),(11,11),(11,21),(21,11),(21,21)]
data = [(2,1),(1,11),(1,21),(11,1),(21,1),(11,11),(11,21),(21,11),(21,21)]
data = [(1,2),(1,11),(1,21),(11,1),(21,1),(11,11),(11,21),(21,11),(21,21)]
I have a brute force algorithm that iterates through the list but there should be a more elegant way or quicker way to do this? Or is there any other methods to speed up this operation? I am expecting lists of ~70k up to 500k tuples.
My method:
from scipy.spatial.distance import euclidean
data = [(1,1),(1,11),(1,21),(11,1),(21,1),(11,11),(11,21),(21,11),(21,21),(1,2),(2,1)]
new_data = []
while len(data) >0:
check = data.pop()
flag = True
for i in data:
if euclidean(check,i) < 5:
flag = False
break
else:
pass
if flag == True:
new_data.append(check)
else:
flag = True
Additional points: Although the list of tuples is coming from some iterative function, the order of tuples is uncertain. Actual number of tuples is unknown until end of for loop. I would rather avoid multiprocessing/multithreading for speed up in this scenario. If necessary I can put up some timings but I dont think its necessary. The solution I have right now is time O(n(n-1)/2) and space complexity of O(n) I think so any improvement would be better.