I have a large set of 2-dimensional points and want to be able to rapidly query the set for the k-Nearest-Neighbours of any point in the 2-d space. Since it's low-dimensional, a KD-Tree seems like a good way to go about it. My initial data set will only be updated very rarely, so the time to query a point should be more important to me than the build-time. However, each time I run the program I will need to re-load the object, so I also need a structure that can be saved and reloaded swiftly.
The two choices readily available are the KDTree structures in SciPy and in SciKit-learn. Below I profile the two of these for build-speed and query-speed across a large range of list lengths. I also pickle the SciKit-learn structure and show the time to re-load the object from the pickle. These are compared in a graph, and the code used to generate the timings is included below.
As I show in the graph, loading from a pickle is faster than building it from scratch by half an order of magnitude for large N, showing that the KDTree is suitable for my use case (ie. frequent re-loads but infrequent re-builds).
Code to compare build-times:
# Profiling the building time for the two KD-tree structures and re-loading from a pickle
import math, timeit, pickle, sklearn.neighbors
the_lengths = [100, 1000, 10000, 100000, 1000000]
theSciPyBuildTime = []
theSklBuildTime = []
theRebuildTime = []
for length in the_lengths:
dim = 5*int(math.sqrt(length))
nTimes = 50
from random import randint
listOfRandom2DPoints = [ [randint(0,dim),randint(0,dim)] for x in range(length)]
setup = """import scipy.spatial
import sklearn.neighbors
length = """ + str(length) + """
dim = """ + str(dim) + """
from random import randint
listOfRandom2DPoints = [ [randint(0,dim),randint(0,dim)] for x in range(length)]"""
theSciPyBuildTime.append( timeit.timeit('scipy.spatial.KDTree(listOfRandom2DPoints, leafsize=20)', setup=setup, number=nTimes)/nTimes )
theSklBuildTime.append( timeit.timeit('sklearn.neighbors.KDTree(listOfRandom2DPoints, leaf_size=20)', setup=setup, number=nTimes)/nTimes )
theTreeSkl = sklearn.neighbors.KDTree(listOfRandom2DPoints, leaf_size=20, metric='euclidean')
f = open('temp.pkl','w')
temp = pickle.dumps(theTreeSkl)
theRebuildTime.append( timeit.timeit('pickle.loads(temp)', 'from __main__ import pickle,temp', number=nTimes)/nTimes )
Code to compare query-times:
# Profiling the query time for the two KD-tree structures
import scipy.spatial, sklearn.neighbors
the_lengths = [100, 1000, 10000, 100000, 1000000, 10000000]
theSciPyQueryTime = []
theSklQueryTime = []
for length in the_lengths:
dim = 5*int(math.sqrt(length))
nTimes = 50
listOfRandom2DPoints = [ [randint(0,dim),randint(0,dim)] for x in range(length)]
setup = """from __main__ import sciPiTree,sklTree
from random import randint
length = """ + str(length) + """
randPoint = [randint(0,""" + str(dim) + """),randint(0,""" + str(dim) + """)]"""
sciPiTree = scipy.spatial.KDTree(listOfRandom2DPoints, leafsize=20)
sklTree = sklearn.neighbors.KDTree(listOfRandom2DPoints, leaf_size=20)
theSciPyQueryTime.append( timeit.timeit('sciPiTree.query(randPoint,10)', setup=setup, number=nTimes)/nTimes )
theSklQueryTime.append( timeit.timeit('sklTree.query(randPoint,10)', setup=setup, number=nTimes)/nTimes )
Questions:
The Result: Although they're getting closer for very large N, SciKit-learn seems to beat SciPy for both build time and query time. Have other people found this?
The Maths: Are there any better structures available for this? I'm only working in a 2D space (although the data will be quite dense so brute force is out), is there a better structure for low-dimensional kNN searches?
The Speed: It looks like the build-time for the two approaches is getting closer at large N but my computer gave up on me - can anyone verify this for me for larger N?! Thanks!! Does re-build time continue to increase roughly linearly as well?
Practicalities: The SciPy KDTree won't pickle. As reported in this post, I'm given the following error "PicklingError: Can't pickle : it's not found as scipy.spatial.kdtree.innernode" - I think this is due to it being a nested structure. According to an answer reported in this post, nested structures can be pickled with dill. However, dill gives me the same error - why is this?