I am benchmarking knn with sklearn. Here is sys info.
sys info
- Intel(R) Xeon(R) L5640 (6 cores 12 siblings);
- Ubuntu 18.04, Python 3.7.3, numpy 1.16.4, sklearn 0.21.2;
- There is no any other jobs/tasks occupying the cpu cores.
dataset
the benchmark is running on sklearn MNIST, which has 1797 Samples, 10 Classes, 8*8 Dimensionality and 17 Features.
Each square in this sample image stands for one pixel, 8*8 Dimensionality in total. Each pixel ranges from 0 to 16.
code
here is the code.
snippet_1:
n_neighbors=5; n_jobs=1; algorithm = 'brute'
model = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=n_jobs, algorithm = algorithm)
model.fit(trainData, trainLabels)
predictions = model.predict(testData)
takes about 0.1s
snippet_2:
n_neighbors=5; n_jobs=1; algorithm = 'kd_tree'
model = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=n_jobs, algorithm = algorithm)
model.fit(trainData, trainLabels)
predictions = model.predict(testData)
takes about 0.2s
I repeated the benchmark multiple times, no matter which one I ran first, snippet_1 is always 2 times faster than snippet_2.
question
Why does 'kd_tree' take more time than 'brute'?
I know "curse of dimensionality", since the doc says it clearly, what I am asking is why is that?