After having troubles with MATLAB I decided to try Python:
I wrote a function that calculates kNN when the samples are of my own class using my own distance function:
def closestK(sample, otherSamples, distFunc, k):
"Returns the closest k samples to sample based on distFunc"
n = len(otherSamples)
d = [distFunc(sample, otherSamples[i]) for i in range(0,n)]
idx = sorted(range(0,len(d)), key=lambda k: d[k])
return idx[1:(k+1)]
def kNN(samples, distFunc, k):
return [[closestK(samples[i], samples, distFunc, k)] for i in range(len(samples))]
and this is the distance function:
@staticmethod def distanceRepr(c1, c2): r1 = c1.repr r2 = c2.repr # because cdist needs 2D array if r1.ndim == 1: r1 = np.vstack([r1,r1]) if r2.ndim == 1: r2 = np.vstack([r2,r2]) return scipy.spatial.distance.cdist(r1, r2, 'euclidean').min()
But it still works amazingly slower compared to the "normal" kNN function, even when using "brute" algorithm. Am I doing something wrong?
UPDATE
I'm adding the constructor of the class. The attribute repr contains a set of vectors (from 1 to whatever) and the distance is calculated to be the minimal euclidean distance between the two sets of repr.
class myCluster:
def __init__(self, index = -1, P = np.array([])):
if index ==-1 :
self.repr = np.array([])
self.IDs = np.array([])
self.n = 0
self.center = np.array([])
else:
self.repr = np.array(P)
self.IDs = np.array(index)
self.n = 1
self.center = np.array(P)
and the rest of relevant code (X is a matrix whose rows are samples and columns are variables):
level = [myCluster(i, X[i,:]) for i in range(0,n)]
kNN(level, myCluster.distanceRepr, 3)
UPDATE 2
I've made some measurements and the line that takes most of the time is
d = [distFunc(sample, otherSamples[i]) for i in range(0,n)]
So there is something with the distFunc
. When I change it to return
np.linalg.norm(c1.repr-c2.repr)
i.e. "normal" vector calculation, with no sorting, the running time stays the same. So the problem lies in the calling of this function. Does it make sense that the use of classes changes the running time by a factor of 60?