Why am I getting error in this K-Nearest Neighbour algorithm?

Question

I'm trying to code something similar to k-nearest neighbors, but way simpler for my algorithms and data structure class.

This is what is asked:

Calculate the k-nearest neighbors of a given point q in A. Note that k is a natural number >= 1, q is an integer that represents the index of the point in A that contains q, and A is an array of 2D points.

This method must return an ordered list of k neighbors, where the order is determined by the indices of the neighbors in A. If there are fewer than k neighbors in A, the list should be empty.

This is my code:

import math

def euclidean_distance(A, q):
    return math.sqrt((A[q][0] - A[0][0])**2 + (A[q][1] - A[0][1])**2)

def knn(k, q, A):
    distances = []
    for i in range(len(A)):
        distances.append((euclidean_distance(A, i), i))
    distances.sort(key=lambda x: x[0])
    return distances[:k]

However, it doesn't pass any tests and I don't know why. Here are the tests:

A = [[0.0,0.0], [1.0,1.0], [4.0,1.0], [0.0,3.0], [1.0,3.0]]
assert kkn(2,0,A) == [1,3]
assert kkn(2,1,A) == [0,4]
assert kkn(2,2,A) == [1,4]
assert kkn(2,3,A) == [1,4]
assert kkn(2,4,A) == [1,3]

In `Euclidean distance`, you're comparing everything to point A[0], not to point A[k]. That's not what you want. — Frank Yellin, Aug 13 '23 at 06:14

score 1 · Accepted Answer · answered Aug 13 '23 at 08:56

There are the following issues in your code:

euclidean_distance can only be used to get the distance from a given point to the one at index 0. It does not help that you have named the argument q, since that is not the q of the caller. To fix this, it would be good to pass two points as arguments.
distances[:k] will always include the point q, since it is at distance 0 from q. This point should be excluded and not be counted in the k nearest.
distances includes distances (by which it was sorted), but these distances are not expected to be returned. The returned list should only have indices, not distances.
The expected output for kkn(2,3,A) is [1,4], but A[4] is closer to A[3] than A[1] is to A[3]. This suggests that the expected output needs to be sorted by index, and not by distance.

Here is your code with those corrections applied to it:

# Takes two points and calculates the distance between them:
def euclidean_distance(a, b):
    return math.sqrt((a[0] - b[0])**2 + (a[1] - b[1])**2)

def knn(k, q, A):
    ref = A[q]
    distances = []
    for i in range(len(A)):
        if i != q:  # exclude the point at q
            distances.append((euclidean_distance(A[i], ref), i))  # use both points
    distances.sort(key=lambda x: x[0])
    indices = [i for dist, i in distances[:k]]  # discard the distances
    return sorted(indices)  # need to return them in sorted order

There are still ways to improve this:

Don't pass an argument to the sort method, as by default the sort will take the first index, and if there are ties, the second index will be taken.
Don't call math.sqrt: comparing the square of the distances is enough
Don't sort all collected points, but only extract the smallest. See Fastest method of getting k smallest numbers in unsorted list? on how heapq can help.

You may need to implement a different tie-breaker, as your question does not explain what should happen if there are ties for the ^th nearest point. There are some possible ways to deal with them, including:

Select the one with the least index -- which is what happens in this implementation;
Select all ties, making the returned list longer than
...some other tie-breaking rule

Thank you very much, I'll strive to not make the same mistakes!! — Mettid, Aug 13 '23 at 16:22
There are ready-made [Euclidean distance](https://docs.python.org/3/library/math.html#math.dist) in `math`. — Mechanic Pig, Aug 14 '23 at 08:13

Why am I getting error in this K-Nearest Neighbour algorithm?

1 Answers1