I am trying to implement a knearest neighbors classifier on the mnist dataset.
I tried to check my results by comparing with the Scipy KNeighborsClassifier
For verification I am using the first 6 samples in the training set and finding the 6 nearest neighbors of the first sample in the training set.
The distance that I calculate does not match with the distance given by the KNeighborsClassifier library.
I am not able to figure out why are my values different.
I have referred to this question for getting the euclidean distance.
My code:
from mlxtend.data import loadlocal_mnist
import numpy as np
from scipy.spatial import distance
train, train_label = loadlocal_mnist(
images_path='train-images.idx3-ubyte',
labels_path='train-labels.idx1-ubyte')
train_label = train_label.reshape(-1, 1)
train = train[:6, :]
train_label = train_label[:6, :]
# print(train_label)
test = train.copy()
test_label = train_label.copy()
test = test[:1, :]
test_label = test_label[:1, :]
for test_idx, test_row in enumerate(test):
for train_idx, train_row in enumerate(train):
d1 = np.linalg.norm(train_row - test_row)
d2 = distance.euclidean(train_row, test_row)
d3 = (((train_row - test_row)**2).sum())**0.5
d4 = np.dot(train_row - test_row, train_row - test_row)**0.5
print(train_idx, d1, d2, d3, d4)
Test set is only the first row of train set
The output for the above is:
0 0.0 0.0 0.0 0.0
1 2618.6771469579826 2618.6771469579826 140.3923074815711 15.937377450509228
2 2372.0210791643485 2372.0210791643485 134.29817571359635 10.770329614269007
3 2139.966354875702 2139.966354875702 122.37646832622684 11.313708498984761
4 2485.1432554281455 2485.1432554281455 135.5322839769182 13.892443989449804
5 2582.292392429641 2582.292392429641 144.69968901141425 14.212670403551895
And this is the KNeighborsClassifier code i compare with:
neigh = KNeighborsClassifier(n_neighbors=6)
neigh.fit(train, train_label)
closest = neigh.kneighbors(test[0].reshape(1, -1))
print(closest)
Output:
(array([[ 0. , 2387.11164381, 2554.81975881, 2582.29239243,
2672.46721215, 2773.14911247]]), array([[0, 1, 3, 5, 4, 2]], dtype=int64))
I am trying to calculate the euclidean distance between the data points to find the nearest neighbors. d1, d2, d3, d4
are 4 different approaches I found from the question linked above and the output are their specific values.
But the distance value I get from the KNeighborsClassifier is different from all of these which also uses euclidean distance as given in the documentation. Why is that happening?