Finding closest pair of points in the plane with non-distinct x-coordinates in O(nlogn)

Question

Most of the implementations of the algorithm to find the closest pair of points in the plane that I've seen online have one of two deficiencies: either they fail to meet an O(nlogn) runtime, or they fail to accommodate the case where some points share an x-coordinate. Is a hash map (or equivalent) required to solve this problem optimally?

Roughly, the algorithm in question is (per CLRS Ch. 33.4):

For an array of points P, create additional arrays X and Y such that X contains all points in P, sorted by x-coordinate and Y contains all points in P, sorted by y-coordinate.
Divide the points in half - drop a vertical line so that you split X into two arrays, X_L and X_R, and divide Y similarly, so that Y_L contains all points left of the line and Y_R contains all points right of the line, both sorted by y-coordinate.
Make recursive calls for each half, passing X_L and Y_L to one and X_R and Y_R to the other, and finding the minimum distance, d in each of those halves.
Lastly, determine if there's a pair with one point on the left and one point on the right of the dividing line with distance smaller than d; through a geometric argument, we find that we can adopt the strategy of just searching through the next 7 points for every point within distance d of the dividing line, meaning the recombination of the divided subproblems is only an O(n) step (even if it looks n² at first glance).

This has some tricky edge cases. One way people deal with this is sorting the strip of points of distance d from the dividing line at every recombination step (e.g. here), but this is known to result in an O(nlog²n) solution.

Another way people deal with edge cases is by assuming each point has a distinct x-coordinate (e.g. here): note the snippet in closestUtil which adds to Pyl (or Y_L as we call it) if the x-coordinate of a point in Y is <= the line, or to Pyr (Y_R) otherwise. Note that if all points lie on the same vertical line, this would result us writing past the end of the array in C++, as we write all n points to Y_L.

So the tricky bit when points can have the same x-coordinate is dividing the points in Y into Y_L and Y_R depending on whether a point p in Y is in X_L or X_R. The pseudocode in CLRS for this is (edited slightly for brevity):

for i = 1 to Y.length
    if Y[i] in X_L
        Y_L.length = Y_L.length + 1;
        Y_L[Y_L.length] = Y[i]
    else Y_R.length = Y_R.length + 1;
        Y_R[Y_R.length] = Y[i]

However, absent of pseudocode, if we're working with plain arrays, we don't have a magic function that can determine whether Y[i] is in X_L in O(1) time. If we're assured that all x-coordinates are distinct, sure - we know that anything with an x-coordinate less than the dividing line is in X_L, so with one comparison we know what array to partition any point p in Y into. But in the case where x-coordinates are not necessarily distinct (e.g. in the case where they all lie on the same vertical line), do we require a hash map to determine whether a point in Y is in X_L or X_R and successfully break down Y into Y_L and Y_R in O(n) time? Or is there another strategy?

it is fairly simple to rotate all points around (0,0) to guarantee no pair having the same X or same Y. This step should take NlogN time itself. Not sure why you are so against using hash map? — Bing Wang, May 07 '22 at 03:04
@BingWang not against it so much as just wondering if I was missing a strategy that used more basic data structures - like I said, I couldn't find a solution online that matched my two criteria, which seemed weird considering how famous this algorithm is, so I thought I might be missing something simple. — jmath, May 07 '22 at 03:54

score 1 · Accepted Answer · answered May 07 '22 at 09:13

1

Yes, there are at least two approaches that work here.

The first, as Bing Wang suggests, is to apply a rotation. If the angle is sufficiently small, this amounts to breaking ties by y coordinate after comparing by x, no other math needed.

The second is to adjust the algorithm on G4G to use a linear-time partitioning algorithm to divide the instance, and a linear-time sorted merge to conquer it. Presumably this was not done because the author valued the simplicity of sorting relative to the previously mentioned algorithms in most programming languages.

answered May 07 '22 at 09:13

David Eisenstat

64,237
7
60
120

Making sure I understand the scheme you've proposed: in the current G4G algorithm, they use O(1) time to divide the input array P into a left and right half, and O(nlogn) time within each recombination step to sort the strip of points by y-coordinate. Instead, you're suggesting we do two cases of O(n) work - first, using O(n) time to partition the input P into subarrays that will return from our closestUtils calls sorted by y-coord, so that we can then merge them together in O(n) time and use that as our set Y. Is that right? – jmath May 07 '22 at 14:02
@jmath you got it! – David Eisenstat May 07 '22 at 14:48

score 0 · Answer 2 · answered Jun 22 '22 at 17:11

Tardos & Kleinberg suggests annotating each point with its position (index) in X. You could do this in N time, or, if you really, really want to, you could do it "for free" in the sorting operation.

With this annotation, you could do your O(1) partitioning, and then take the position pr of the right-most point in Xl in O(1), using it to determine weather a point in Y goes in Yl (position <= pr), or Yr (position > pr). This does not require an extra data structure like a hash map, but it does require that those same positions are used in X and Y.

NB: It is not immediately obvious to me that the partitioning of Y is the only problem that arises when multiple points have the same coordinate on the x-axis. It seems to me that the proof of linearity of the comparisons neccesary across partitions breaks, but I have seen only the proof that you need only 15 comparisons, not the proof for the stricter 7-point version, so i cannot be sure.

Finding closest pair of points in the plane with non-distinct x-coordinates in O(nlogn)

2 Answers2