How do I find the closest point for each point in a data set while keeping precision?

Question

This is my data set: https://pastebin.com/SsuKP2eH

I'm trying to find the nearest point for all points in the data set. These points are latitude and longitude on the Earth's surface. Of course, the nearest point cannot be the same point.

I tried the KDTree solutions listed in this post: https://stackoverflow.com/a/45128643 and changed the poster's random points (generated by np.random.uniform) to my own data set.

I expected to get an array full of distances, but instead, I got an array full of zeroes with some numbers like 2.87722e-06 and 0.616582 sprinkled in. This wasn't what I wanted. I tried the other solution, NearestNeighbours, on my data set and got the same result. So, I did some debugging and reduced the range of random numbers he used, making it closer to my own data set.

import numpy as np
import scipy.spatial as spatial
import pandas as pd

R = 6367

def using_kdtree(data):
    "Based on https://stackoverflow.com/q/43020919/190597"
    def dist_to_arclength(chord_length):
        """
        https://en.wikipedia.org/wiki/Great-circle_distance
        Convert Euclidean chord length to great circle arc length
        """
        central_angle = 2*np.arcsin(chord_length/(2.0*R)) 
        arclength = R*central_angle
        return arclength

    phi = np.deg2rad(data['Latitude'])
    theta = np.deg2rad(data['Longitude'])
    data['x'] = R * np.cos(phi) * np.cos(theta)
    data['y'] = R * np.cos(phi) * np.sin(theta)
    data['z'] = R * np.sin(phi)
    tree = spatial.KDTree(data[['x', 'y','z']])
    distance, index = tree.query(data[['x', 'y','z']], k=2)
    return dist_to_arclength(distance[:, 1])
    #return distance, index


np.random.seed(2017)
N = 1000
#data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N), 'Longitude':np.random.uniform(0, 360,size=N)})
data = pd.DataFrame({'Latitude':np.random.uniform(-49.19,49.32,size=N), 'Longitude':np.random.uniform(-123.02, -123.23,size=N)})

result = using_kdtree(data)

I found that the resulting distances array had small values, close to 0. This makes me believe that the reason why the result array for my data set is full of zeroes is because the differences between points are very small. Somewhere, the KD Tree/nearest neighbours loses precision and outputs garbage. Is there a way to make them keep the precision of my floats? The brute-force method can keep precision but it is far too slow with 7200 points to iterate through.

Find a set of suspicious points, and work out your calculation by hand for them (e.g. via debugger) — Mad Physicist, Jul 30 '21 at 03:46
K., your precision is bogus. How could you PRETEND to measure distances with a precision in the order of 0.1 nm in a geographical context? Further, when you use these _precise_ numbers to compute the _x, y, z_ coordinates (for microscopical position differences!) the precision is necessarily reduced. So, even if you think all the points are distinct, in the _x, y, z_ space many points are collapsed. I converted your coordinates to map coordinates, using plate carrée, and a zillion of points collapsed to the same positions. I understand that what I say is not what you'd like to hear, but… — gboffi, Jul 31 '21 at 09:20
Let's move on, my suggestions: ① project your points using, e.g., Plate Carrée (it's a square, 15km × 15km, PC is good enough) ② round the positions to a reasonable number of digits, that reflects ***the real accuracy*** of your measurements ③ create a new array discarding duplicates ④ save the duplicates indices in the complete array in a dictionary, indexed on the position in the pruned array ⑤ use kdtree on the pruned array, what you get is the index of the closest ***cluster*** of dat points. — gboffi, Jul 31 '21 at 09:33

score 1 · Answer 1 · edited Jul 31 '21 at 10:27

I think what's happening is that k=2 in

    distance, index = tree.query(data[['x', 'y','z']], k=2)

tells KDTree you want the closest two points to a point. So the closest is obviously the point itself with distance from itself being zero. Also if you print index you see a Nx2 array and each row starts with the number of the row. This is KDTree's way of saying well the closest point to the i-th point is the i-th point itself.

Obviously that is not useful and you probably want only the 2nd closest point. Fortunately I found this in the documentation of the k parameter of query

Either the number of nearest neighbors to return, or a list of the k-th nearest neighbors to return, starting from 1. https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html#scipy.spatial.KDTree.query

So

    distance, index = tree.query(data[['x', 'y','z']], k=[2])

gives only the distance and index of the 2nd to closest point.

How do I find the closest point for each point in a data set while keeping precision?

1 Answers1