3

Most of the implementations of the algorithm to find the closest pair of points in the plane that I've seen online have one of two deficiencies: either they fail to meet an O(nlogn) runtime, or they fail to accommodate the case where some points share an x-coordinate. Is a hash map (or equivalent) required to solve this problem optimally?

Roughly, the algorithm in question is (per CLRS Ch. 33.4):

  1. For an array of points P, create additional arrays X and Y such that X contains all points in P, sorted by x-coordinate and Y contains all points in P, sorted by y-coordinate.
  2. Divide the points in half - drop a vertical line so that you split X into two arrays, XL and XR, and divide Y similarly, so that YL contains all points left of the line and YR contains all points right of the line, both sorted by y-coordinate.
  3. Make recursive calls for each half, passing XL and YL to one and XR and YR to the other, and finding the minimum distance, d in each of those halves.
  4. Lastly, determine if there's a pair with one point on the left and one point on the right of the dividing line with distance smaller than d; through a geometric argument, we find that we can adopt the strategy of just searching through the next 7 points for every point within distance d of the dividing line, meaning the recombination of the divided subproblems is only an O(n) step (even if it looks n2 at first glance).

This has some tricky edge cases. One way people deal with this is sorting the strip of points of distance d from the dividing line at every recombination step (e.g. here), but this is known to result in an O(nlog2n) solution.

Another way people deal with edge cases is by assuming each point has a distinct x-coordinate (e.g. here): note the snippet in closestUtil which adds to Pyl (or YL as we call it) if the x-coordinate of a point in Y is <= the line, or to Pyr (YR) otherwise. Note that if all points lie on the same vertical line, this would result us writing past the end of the array in C++, as we write all n points to YL.

So the tricky bit when points can have the same x-coordinate is dividing the points in Y into YL and YR depending on whether a point p in Y is in XL or XR. The pseudocode in CLRS for this is (edited slightly for brevity):

for i = 1 to Y.length
    if Y[i] in X_L
        Y_L.length = Y_L.length + 1;
        Y_L[Y_L.length] = Y[i]
    else Y_R.length = Y_R.length + 1;
        Y_R[Y_R.length] = Y[i]

However, absent of pseudocode, if we're working with plain arrays, we don't have a magic function that can determine whether Y[i] is in X_L in O(1) time. If we're assured that all x-coordinates are distinct, sure - we know that anything with an x-coordinate less than the dividing line is in XL, so with one comparison we know what array to partition any point p in Y into. But in the case where x-coordinates are not necessarily distinct (e.g. in the case where they all lie on the same vertical line), do we require a hash map to determine whether a point in Y is in XL or XR and successfully break down Y into YL and YR in O(n) time? Or is there another strategy?

jmath
  • 33
  • 5
  • 1
    it is fairly simple to rotate all points around (0,0) to guarantee no pair having the same X or same Y. This step should take NlogN time itself. Not sure why you are so against using hash map? – Bing Wang May 07 '22 at 03:04
  • @BingWang not against it so much as just wondering if I was missing a strategy that used more basic data structures - like I said, I couldn't find a solution online that matched my two criteria, which seemed weird considering how famous this algorithm is, so I thought I might be missing something simple. – jmath May 07 '22 at 03:54

2 Answers2

1

Yes, there are at least two approaches that work here.

The first, as Bing Wang suggests, is to apply a rotation. If the angle is sufficiently small, this amounts to breaking ties by y coordinate after comparing by x, no other math needed.

The second is to adjust the algorithm on G4G to use a linear-time partitioning algorithm to divide the instance, and a linear-time sorted merge to conquer it. Presumably this was not done because the author valued the simplicity of sorting relative to the previously mentioned algorithms in most programming languages.

David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • Making sure I understand the scheme you've proposed: in the current G4G algorithm, they use O(1) time to divide the input array P into a left and right half, and O(nlogn) time within each recombination step to sort the strip of points by y-coordinate. Instead, you're suggesting we do two cases of O(n) work - first, using O(n) time to partition the input P into subarrays that will return from our closestUtils calls sorted by y-coord, so that we can then merge them together in O(n) time and use that as our set Y. Is that right? – jmath May 07 '22 at 14:02
  • @jmath you got it! – David Eisenstat May 07 '22 at 14:48
0

Tardos & Kleinberg suggests annotating each point with its position (index) in X. You could do this in N time, or, if you really, really want to, you could do it "for free" in the sorting operation.

With this annotation, you could do your O(1) partitioning, and then take the position pr of the right-most point in Xl in O(1), using it to determine weather a point in Y goes in Yl (position <= pr), or Yr (position > pr). This does not require an extra data structure like a hash map, but it does require that those same positions are used in X and Y.

NB: It is not immediately obvious to me that the partitioning of Y is the only problem that arises when multiple points have the same coordinate on the x-axis. It seems to me that the proof of linearity of the comparisons neccesary across partitions breaks, but I have seen only the proof that you need only 15 comparisons, not the proof for the stricter 7-point version, so i cannot be sure.

Bladt
  • 303
  • 3
  • 11