0

I'm trying to code something similar to k-nearest neighbors, but way simpler for my algorithms and data structure class.

This is what is asked:

Calculate the k-nearest neighbors of a given point q in A. Note that k is a natural number >= 1, q is an integer that represents the index of the point in A that contains q, and A is an array of 2D points.

This method must return an ordered list of k neighbors, where the order is determined by the indices of the neighbors in A. If there are fewer than k neighbors in A, the list should be empty.

This is my code:

import math

def euclidean_distance(A, q):
    return math.sqrt((A[q][0] - A[0][0])**2 + (A[q][1] - A[0][1])**2)

def knn(k, q, A):
    distances = []
    for i in range(len(A)):
        distances.append((euclidean_distance(A, i), i))
    distances.sort(key=lambda x: x[0])
    return distances[:k]

However, it doesn't pass any tests and I don't know why. Here are the tests:

A = [[0.0,0.0], [1.0,1.0], [4.0,1.0], [0.0,3.0], [1.0,3.0]]
assert kkn(2,0,A) == [1,3]
assert kkn(2,1,A) == [0,4]
assert kkn(2,2,A) == [1,4]
assert kkn(2,3,A) == [1,4]
assert kkn(2,4,A) == [1,3]
trincot
  • 317,000
  • 35
  • 244
  • 286
Mettid
  • 5
  • 1

1 Answers1

1

There are the following issues in your code:

  • euclidean_distance can only be used to get the distance from a given point to the one at index 0. It does not help that you have named the argument q, since that is not the q of the caller. To fix this, it would be good to pass two points as arguments.

  • distances[:k] will always include the point q, since it is at distance 0 from q. This point should be excluded and not be counted in the k nearest.

  • distances includes distances (by which it was sorted), but these distances are not expected to be returned. The returned list should only have indices, not distances.

  • The expected output for kkn(2,3,A) is [1,4], but A[4] is closer to A[3] than A[1] is to A[3]. This suggests that the expected output needs to be sorted by index, and not by distance.

Here is your code with those corrections applied to it:

# Takes two points and calculates the distance between them:
def euclidean_distance(a, b):
    return math.sqrt((a[0] - b[0])**2 + (a[1] - b[1])**2)

def knn(k, q, A):
    ref = A[q]
    distances = []
    for i in range(len(A)):
        if i != q:  # exclude the point at q
            distances.append((euclidean_distance(A[i], ref), i))  # use both points
    distances.sort(key=lambda x: x[0])
    indices = [i for dist, i in distances[:k]]  # discard the distances
    return sorted(indices)  # need to return them in sorted order

There are still ways to improve this:

  • Don't pass an argument to the sort method, as by default the sort will take the first index, and if there are ties, the second index will be taken.
  • Don't call math.sqrt: comparing the square of the distances is enough
  • Don't sort all collected points, but only extract the smallest. See Fastest method of getting k smallest numbers in unsorted list? on how heapq can help.

You may need to implement a different tie-breaker, as your question does not explain what should happen if there are ties for the th nearest point. There are some possible ways to deal with them, including:

  • Select the one with the least index -- which is what happens in this implementation;
  • Select all ties, making the returned list longer than
  • ...some other tie-breaking rule
trincot
  • 317,000
  • 35
  • 244
  • 286