I have read this Q/A - knn with big sparse matrices in python and I have a similar problem. I have a sparse array of radar data of size - 125930 and longitude and latitude have identical shape. Only 5 % of the data is not NULL. The rest are all NULLs.
Data is available on a sphere so I use VPTree and great circle distance to compute distances. The grid spacing is irregular and I would like to interpolate this data to a regular grid on a sphere with distance in the lat and lon direction with grid spacing of 0.05 degrees. The spacing between two latitudes is 0.01 in the coarse grid and spacing between two longitudes is 0.09. So I create my mesh grid in the following way and I have the following number of grid points - 12960000 in total based on the maximum value of latitude and longitude of the irregular grid.
latGrid = np.arange(minLat,maxLat,0.05)
lonGrid = np.arange(minLo,maxLo,0.05)
gridLon,gridLat = np.meshgrid(lonGrid,latGrid)
grid_points = np.c_[gridLon.ravel(),gridLat.ravel()]
radar_data = radar_element[np.nonzero(radar_element)]
lat_surface = lat[np.nonzero(radar_element)]
lon_surface = lon[np.nonzero(radar_element)]
points = np.c_[lon_surface,lat_surface]
if points.size > 0:
tree = vptree.VPTree(points,greatCircleDistance)
for grid_point in (grid_points):
indices = tree.get_all_in_range(grid_point,4.3)
args.append(indices)
The problem is the query
get_all_in_range
It currently takes 12 minutes to run for every pass of the above data and I have a total of 175 passes and the total time is 35 hours which is unacceptable.Is there any way to reduce the number of grid points(based on some similarity) that is sent to the query as the bulk of the indices that is returned back is null ? I have also used Scikit-learn's BallTree and the performance is even worse than this one. I am not sure whether FLANN is an appropriate usage for my problem.