1

I am benchmarking knn with sklearn. Here is sys info.

sys info

  • Intel(R) Xeon(R) L5640 (6 cores 12 siblings);
  • Ubuntu 18.04, Python 3.7.3, numpy 1.16.4, sklearn 0.21.2;
  • There is no any other jobs/tasks occupying the cpu cores.

dataset

the benchmark is running on sklearn MNIST, which has 1797 Samples, 10 Classes, 8*8 Dimensionality and 17 Features.

Each square in this sample image stands for one pixel, 8*8 Dimensionality in total. Each pixel ranges from 0 to 16.

enter image description here

code

here is the code.

snippet_1:

n_neighbors=5; n_jobs=1; algorithm = 'brute'
model = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=n_jobs, algorithm = algorithm)
model.fit(trainData, trainLabels)
predictions = model.predict(testData)

takes about 0.1s

snippet_2:

n_neighbors=5; n_jobs=1; algorithm = 'kd_tree'
model = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=n_jobs, algorithm = algorithm)
model.fit(trainData, trainLabels)
predictions = model.predict(testData)

takes about 0.2s

I repeated the benchmark multiple times, no matter which one I ran first, snippet_1 is always 2 times faster than snippet_2.

question

Why does 'kd_tree' take more time than 'brute'?

I know "curse of dimensionality", since the doc says it clearly, what I am asking is why is that?

  • 1
    How many dimensions/input points? KDTree gets slower than brute force when the dimensionality gets higher. – Daniel F Sep 23 '19 at 11:30
  • @DanielF Thanks for your reminder. I've updated the OP. Each sample has 8*8 Dimensionality. –  Sep 23 '19 at 12:07
  • what does "8*8 dimensionality and 17 features" mean? If each feature has 64 subfeatures, your number of dimensions is almost as large as your number of samples. KDTree is going to be slower in that case. – Daniel F Sep 23 '19 at 13:21
  • @DanielF I've updated the OP for "8*8 dimensionality and 17 features". –  Sep 24 '19 at 01:04
  • @DanielF Is there some mathematical deduction or explanation why kdtree gets "slower" when the dimensionality gets higher? What is the threshold? –  Sep 24 '19 at 01:05

2 Answers2

0

The answer seems to be related to dimensionality associated to your models. Curse of dimensionality as it is also known. KD-tree has a very poor scaling when it comes to a dimension above 15/20 (kinda exponential) whereas brute Force seems to follow a more linear-like pattern. When run on GPUs, for higher dimensions, brute force can indeed be faster. Here another researcher found a similar problem: Comparison search time between K-D tree and Brute-force

Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
0

In general, KD-Tree will be slower than brute force if N < 2**k, where k is the number of dimensions (in this case 8 * 8 = 64) and N is the number of samples. In this case 2**64 = 1.8E19 >> 1797, so KDTree is far slower.

Basically, a KDTree does binary splits of the data along each dimension as a first step. If it has enough data to do that, it can guess the closest neighbors by the number of splits in common they have. If N < 2**k, it runs out of data before it runs out of dimensions to split. It then has no distance information about the other dimensions. With no good guess, it still has to brute force the rest of the dimensions, making the KDTree unnecessary overhead.

A more in-depth discussion of the issues and possible solutions can be found here. For this application, the third answer suggesting using PCA first to reduce your dimensionality is probably your best bet.

Daniel F
  • 13,620
  • 2
  • 29
  • 55