Performance issues with Nearest Neighbor Search and VPTrees

Question

I have read this Q/A - knn with big sparse matrices in python and I have a similar problem. I have a sparse array of radar data of size - 125930 and longitude and latitude have identical shape. Only 5 % of the data is not NULL. The rest are all NULLs.

Data is available on a sphere so I use VPTree and great circle distance to compute distances. The grid spacing is irregular and I would like to interpolate this data to a regular grid on a sphere with distance in the lat and lon direction with grid spacing of 0.05 degrees. The spacing between two latitudes is 0.01 in the coarse grid and spacing between two longitudes is 0.09. So I create my mesh grid in the following way and I have the following number of grid points - 12960000 in total based on the maximum value of latitude and longitude of the irregular grid.

latGrid = np.arange(minLat,maxLat,0.05)
lonGrid = np.arange(minLo,maxLo,0.05)


gridLon,gridLat = np.meshgrid(lonGrid,latGrid)
grid_points = np.c_[gridLon.ravel(),gridLat.ravel()]

radar_data = radar_element[np.nonzero(radar_element)]
lat_surface = lat[np.nonzero(radar_element)]
lon_surface = lon[np.nonzero(radar_element)]

points = np.c_[lon_surface,lat_surface]
if points.size > 0:
   tree = vptree.VPTree(points,greatCircleDistance)
    for grid_point in (grid_points):
        indices = tree.get_all_in_range(grid_point,4.3)
        args.append(indices)

The problem is the query

get_all_in_range

It currently takes 12 minutes to run for every pass of the above data and I have a total of 175 passes and the total time is 35 hours which is unacceptable.Is there any way to reduce the number of grid points(based on some similarity) that is sent to the query as the bulk of the indices that is returned back is null ? I have also used Scikit-learn's BallTree and the performance is even worse than this one. I am not sure whether FLANN is an appropriate usage for my problem.

Is the data completely unsorted? How long would it take to sort it? I'd rather not make a finer but a coarse grid for pre-selecting data. Then use the VPTree on the pre-selected data, providing the original coordinates of coarse. You could iterate the process of pre-selecting data and refine the grid. If properly implemented, this would already be a sufficient algorithm with no need for VPTree. — mikuszefski, Aug 24 '18 at 06:49
...if the data is in theta, phi, the sphere is some sort of problem. I'd probably split the work in two parts. First the data around the equator, then rotate by 90 degree to put the poles onto the equator ( taking into account the overlap, especially on the axis of rotation) — mikuszefski, Aug 24 '18 at 06:56
If you say "nearest neighbor " do you mean the points within a certain given distance? — mikuszefski, Aug 24 '18 at 11:04
@mikuszefski - yes I mean points with a certain given distance. The point with Annoy is that I have to convert to Cartesian and then reconvert back to lat/lon and I have small grid spacing where curvature is important. So I am hesitant to try. I will try as a last resort of course. — gansub, Aug 24 '18 at 11:10
@mikuszefski - one thing I would like to explore is converting get_all_in_range from serial to parallel. In other words use threads. Right now I do serial nearest neighbor searches. — gansub, Aug 24 '18 at 11:12
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/178690/discussion-between-gansub-and-mikuszefski). — gansub, Aug 24 '18 at 14:24

score 1 · Answer 1 · answered Aug 19 '18 at 18:09

1

I would just convert to 3D coordinates and use Euclidean distance.

You can use something like Annoy (disclosure: I'm the author)

Example from something I built: https://github.com/erikbern/ping/blob/master/plot.py

answered Aug 19 '18 at 18:09

Erik Bernhardsson

497
4
3

score 1 · Answer 2 · answered Aug 31 '18 at 15:21

I would first put your radar observations into a spatial index as lat/lon. For the sake of Python, let's use R-Tree. I would follow this concept:

http://toblerity.org/rtree/tutorial.html#using-rtree-as-a-cheapo-spatial-database

Load your radar observations:

for id, (y, x, m) in enumerate(observations):
    index.insert(id=id, bounds=(x, y, x, y), obj=(y,x,m))

Then for your desired Great Circle distance I would calculate a "safe" Euclidean distance to filter out candidate points.

You can query the R-Tree for candidate points near the (x,y) of your output grid points:

candidates  =  idx.intersection((x - safe_distance, y - safe_distance, x + safe_distance, y+safe distance), objects=True)]

This will give you list of candidate points as [(y, x, m),...]

Now filter the candidates using a Great Circle calculation. You can then do the interpolation with the remaining point objects.

score 1 · Answer 3 · answered Sep 13 '18 at 15:29

Here's another strategy that approaches the problem in the opposite direction. I think this is a better approach for several reasons:

The radar observation dataset is sparse, so to run calculations on every output point seems wasteful, even with the help of a spatial index.
The output grid has regular spacing so it can be easily calculated.
So it would be less work to look at every observation point and calculate which output points are nearby, and use that information to build a list of output points and which observation points they are close to.

The observation data is in the form (X,Y,M) (longitude, latitude, measurement).

The output is a grid with regular spacing, like every .1 degrees.

First create a dictionary for your output grid points that are near to observations:

output = {}

Then take an observation point and find points nearby that are within the Great Circle distance. Start checking at a nearby output point and iterate your way outwards by row/column until you have found all the possible observation points within your GCD.

This will give you a list of grid points that are within the GCD of X and Y. Something like:

get_points(X,Y) ----> [[x1, y1], [x2,y2]....]

Now we'll flip this around. We want to store each output point and a list of the observation points that are near it. To store the point in the output dictionary we need some kind of unique key. A geohash (which interleaves latitude and longitude and generates a unique string) is perfect for this.

For each output point (xn, yn) compute the geohash, and add an entry to the output dictionary with (xn, yn) and start (or append to) a list of observations:

key = Geohash.encode(y,x)
if key not in output:
    output[key] = { 'coords': [x, y], 'observations' = [[X,Y,M]] }
else:
    output[key][observations].append([X,Y,M])

We store the original x,y instead of reversing the geohash and losing accuracy.

When you have run all of your observations you will then have a dictionary of all the output points that require calculation, with each point having a list of observations that are within the GCD requirement.

You can then loop through the points and calculate the output array index and interpolated value, and write that to your output array:

def get_indices(x,y):
    ''' converts a x,y to row,column in the output array '''
    ....
    return row, column

def get_weighted value(point):
    ''' takes a point entry and performs inverse distance weighting
        using the point's coordinates and list of observations '''
    ....
    return value

for point in output:
    row, column = get_indices(point['coords'])
    idw = get_weighted_value(point)
    outarray[column,row] = idw

Performance issues with Nearest Neighbor Search and VPTrees

3 Answers3