6

I have large 2D arrays with unsorted (X,Y) points, for which I need to know which points are in close proximity to each other (nearest-neighbor lookup). I have used cKDTree and query_ball_tree with succes for arrays with around 500,000 (X,Y) points. However, when I try the same algorithm for datasets of more than 1,000,000 points, query_ball_tree results in a MemoryError.

I use 64-bit Windows with 16Gb of internal memory, and have tried both 32-bit and 64-bit versions of Python and the extension modules (scipy and numpy).

def Construct_SearchTree(AllXyPoints):
    KDsearch = cKDTree(AllXyPoints)  
    return KDsearch.query_ball_tree(KDsearch, Maxdist)

My questions:

1) does anybody know of an alternative to cKDTree / query_ball_tree that consumes less memory? Speed is less important in this case that memory usage.

2) I hoped that switching from 32-bit to 64-bit python & extensions would solve the MemoryError. What could be the reason that it didn't?

Thanks for your incoming help and advice.

1 Answers1

6

I experienced a MemoryError with SciPy's cKDTree during construction and scikit-learn's KDTree when calling .query_radius(). I found that Scikit-learn's BallTree was more memory efficient, and using a BallTree solved the issue for me. I've tested BallTree with 1 million data points on my 64-bit system. It still consumes all my available memory (12GB) and some swap space, but I don't get a MemoryError.

Queries on a BallTree will not be as fast as a KDTree since your data is 2D, and BallTrees are slower than KDTrees when d <= 3 (see explanation here). However, given that cKDtree and scikit-learn's KDTree both raise MemorErrors (on my system, anyway), the simplest solution is to use a BallTree.

from sklearn.neighbors import BallTree
import numpy as np

max_dist = .1
points = np.random.normal(size=2000000).reshape(1000000, 2) #1 million points
ball_tree = BallTree(points)

neighbors = ball_tree.query_radius(points, max_dist)

Depending on your Maxdist, the returned result can consume a lot of memory (up to O(n^2)), but scikit-learn's BallTree.query_radius() returns an np.array of np.arrays rather than a list of lists so it should save you some memory there (see this answer for an explanation).

JaminSore
  • 3,758
  • 1
  • 25
  • 21
  • A Ball-tree is also not a KD-tree (and the effects are unmentioned; despite saying it's working with some given dimensions). But i'm also recommending sklearn for these data-stuctures. – sascha Jan 24 '18 at 15:15
  • @sascha, I've updated my answer. Apart from performance, did you have other effects in mind? – JaminSore Jan 24 '18 at 16:55
  • @JaminSore Rather than querying the nearest neighbors within a ***single list*** Is it possible to use the ball tree method to compute the ***nearest distance neighbours for coordinates between two nested lists***: `a` and `b`, where `a dimension = 1000,2`, `b dimension = 2000,2`: e.g. `a=[ [1.2, 1.3], [6.5, 10] ... ]` and `b= [ [11,1.2] , [9.6, 8]... ]`. I've done this using `scipy.spatial.kdtree` by doing: `tree = spatial.KDTree(a)` and `closest_distance, closest_index = tree.query(b, k=1, p=2)`. But I'm not sure is possible with `ball tree`. Is? – Chuck Nov 04 '18 at 11:55
  • @Chuck Yes. From the [docs](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html#sklearn.neighbors.BallTree.query) `query()` takes an *array-like* (read `list`, `numpy.array` or anything that quacks like an array) as the first argument. In my example, the `points` I passed to `query_radius()` did not have to be the same points used in construction--the same goes for `query()` – JaminSore Nov 05 '18 at 19:40