Best way to remove similar points in a list of points

Question

I have a list of points that looks like this:

points = [(54592748,54593510),(54592745,54593512), ...]

Many of these points are similar in the sense that points[n][0] is almost equal to points[m][0] AND points[n][1] is almost equal to points[m][1]. Where 'almost equal' is a whatever integer I decide. I would like to filter out all the similar points from the list, keeping just one of it.

Here is my code.

points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10 # max distance allowed between two points
to_compare = points[:] # make a list of item to compare
to_remove = set() # keep track of items to be removed

for point in points:
    to_compare.remove(point) # do not compare with itself
    for other_point in to_compare:
        if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
             to_remove.add(other_point)

for point in to_remove:
    points.remove(point)

It works...

>>>points
[(54592748, 54593510), (117628626, 117630648), (1354358, 1619520)]

but I am looking for a faster solution since my list is millions items long.

PyPy helped a lot, it speeded up 6 the whole process 6 times, but probably there is a more efficient way to do this in the first place, or not?

Any help is very welcome.

=======

UPDATE

I have tested some of the answers with the points object you can pickle.load() from here https://mega.nz/#!TVci1KDS!tE5fTnjpPwbvpFTmW1TLsVXDvYHbRF8F7g10KGdOPCs

My code takes 1104 seconds and reduces the list to 96428 points (from 99920). David's code do the job in 14 seconds! But misses something, 96431 points left. Martin's code takes 0.06 seconds!! But also misses something, 96462 points left.

Any clue about why the results are not the same?

Your solution works, as you said. Because of that I think your question is more suited for http://codereview.stackexchange.com/ — Idos, Jan 21 '16 at 11:41
This question may be appropriate for [codereview.se], but make sure it's **working, non-hypothetical code** and you **include all of it**. Please read the Help Center before posting if you decide to go there. — Quill, Jan 21 '16 at 11:43
You could split the list and execute your alogrithem in parallel threads. — Simulant, Jan 21 '16 at 11:54

score 2 · Answer 1 · answered Jan 21 '16 at 12:07

Sorting the list first avoids the inner for loop and thus the n^2 time. I'm not sure if it's practically any quicker though since I don't have your full data. Try this (it outputs the same as far as i can see from your example points, just ordered).

points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10  # max distance allowed between two points
points.sort()
to_remove = set()  # keep track of items to be removed

for i, point in enumerate(points):
    if i == len(points) - 1:
        break
    other_point = points[i+1]
    if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
        to_remove.add(point)

for point in to_remove:
    points.remove(point)

print(points)

Have you checked the update? Can you comment on that? – alec_djinn Jan 21 '16 at 14:59 — alec_djinn, Jan 21 '16 at 14:59

Martin Evans · Accepted Answer · 2016-01-21T16:10:52.140

2

Depending on how accurate you need this to be, the following approach should work well:

points = [(54592748, 54593510), (54592745, 54593512), (117628626, 117630648), (1354358, 1619520), (54592746, 54593509)]
d = 20
hpoints = {((x - (x % d)), (y - (y % d))) : (x,y) for x, y in points}

for x in hpoints.itervalues():  
    print x

This converts each point into a dictionary key with each x and y coordinate rounded by its modulus. The result is a dictionary holding the coordinate of the last point in a given area. For the data you have given, this would display the following:

(117628626, 117630648)
(54592746, 54593509)
(1354358, 1619520)

edited Jan 21 '16 at 16:10

answered Jan 21 '16 at 12:46

Martin Evans

45,791
17
81
97

Nice logic. Plus it does reduces the time to half compared to OP's code. – Muhammad Tahir Jan 21 '16 at 12:58
It's an impressive speedup and a very clever logic, I am testing if all the resulting points are ok. – alec_djinn Jan 21 '16 at 13:50
@Martin Evans Have you checked the update? Can you comment on that? – alec_djinn Jan 21 '16 at 14:58
By definition, my solution will give you slightly different results. Visually it is grouping points on a grid, so points near to each other in different cells will not be removed. You can of course adjust the grid size. – Martin Evans Jan 21 '16 at 15:01

score 0 · Answer 3 · edited May 23 '17 at 12:30

This function for getting unique items from a list (it isn't mine, I found it a while back) only loops over the list once (plus dictionary lookups).

def unique(seq, idfun=None): 
  # order preserving
  if idfun is None:
      def idfun(x): return x
  seen = {}
  result = []
  for item in seq:
      marker = idfun(item)
      # in old Python versions:
      # if seen.has_key(marker)
      # but in new ones:
      if marker in seen: continue
      seen[marker] = 1
      result.append(item)
  return result

The id function will require some cleverness. point[0] is divided by error and floored to an integer. So all point[0]'s such that x*error <= point[0] < (x+1)*error are the same and similarly for point[1].

def id(point):
   error = 4
   x = point[0]//error
   y = point[1]//error
   idValue = str(x)+"//"+str(y)
   return idValue

So these functions will reduce points between consecutive multiples of error to the same point. The good news is that it only touches the original list once plus the dictionary lookups. The bad news is that this id function won't catch for example 15 and 17 should be the same because 15 reduces to 3 and 17 reduces to 4. It is possible that will some cleverness, this issue could be resolved.

[NOTE: I originally used exponents of primes for the idValue, but the exponents would be way to large. If you could make the idValue an int, that would increase lookup speed ]

Best way to remove similar points in a list of points

UPDATE

3 Answers3