I have a list of points that looks like this:
points = [(54592748,54593510),(54592745,54593512), ...]
Many of these points are similar in the sense that points[n][0] is almost equal to points[m][0] AND points[n][1] is almost equal to points[m][1]. Where 'almost equal' is a whatever integer I decide. I would like to filter out all the similar points from the list, keeping just one of it.
Here is my code.
points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10 # max distance allowed between two points
to_compare = points[:] # make a list of item to compare
to_remove = set() # keep track of items to be removed
for point in points:
to_compare.remove(point) # do not compare with itself
for other_point in to_compare:
if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
to_remove.add(other_point)
for point in to_remove:
points.remove(point)
It works...
>>>points
[(54592748, 54593510), (117628626, 117630648), (1354358, 1619520)]
but I am looking for a faster solution since my list is millions items long.
PyPy helped a lot, it speeded up 6 the whole process 6 times, but probably there is a more efficient way to do this in the first place, or not?
Any help is very welcome.
=======
UPDATE
I have tested some of the answers with the points object you can pickle.load() from here https://mega.nz/#!TVci1KDS!tE5fTnjpPwbvpFTmW1TLsVXDvYHbRF8F7g10KGdOPCs
My code takes 1104 seconds and reduces the list to 96428 points (from 99920). David's code do the job in 14 seconds! But misses something, 96431 points left. Martin's code takes 0.06 seconds!! But also misses something, 96462 points left.
Any clue about why the results are not the same?