0

Can you please tell me how to calculate distance between every point in my testData properly.

For now I am getting only one single value, whereas I should get distance from each point in data set and be able to assign it a class. I have to use numpy for this.

======================================================================== Now the problem is that I am getting this error and don't know how to fix it.

KeyError: 0

I am trying to obtain accuracy of classified labels. Any ideas, please?

import matplotlib.pyplot as plt
import random
import numpy as np
import operator
from sklearn.cross_validation import train_test_split
# In[1]
def readFile():
    f = open('iris.data', 'r')
    d = np.dtype([ ('features',np.float,(4,)),('class',np.str_,20)])
    data = np.genfromtxt(f, dtype = d ,delimiter=",")
    dataPoints = data['features']
    labels = data['class']
    return dataPoints, labels
# In[2]
def normalizeData(dataPoints):
    #normalize the data so the values will be between 0 and 1
    dataPointsNorm = (dataPoints - dataPoints.min())/(dataPoints.max() - dataPoints.min())
    return dataPointsNorm
def crossVal(dataPointsNorm):
    # spliting for train and test set for crossvalidation
    trainData, testData = train_test_split(dataPointsNorm, test_size=0.20, random_state=25)
    return trainData, testData

def calculateDistance(trainData, testData): 
    #Euclidean distance calculation on numpy arrays
    distance = np.sqrt(np.sum((trainData - testData)**2, axis=-1))
    # Argsort sorts indices from closest to furthest neighbor, in ascending order
    sortDistance = distance.argsort()
    return distance, sortDistance
# In[4]
def classifyKnn(testData, trainData, labels, k):
    # Calculating nearest neighbours and based on majority vote assigning the class
    classCount = {}
    for i in range(k):
        distance, sortedDistIndices = calculateDistance(trainData, testData[i])
        voteLabel = labels[sortedDistIndices][i]
        #print voteLabel
        classCount[voteLabel] = classCount.get(voteLabel,0)+1
        print 'Class Count: ', classCount
    # Sorting dictionary to return voted class
    sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0], classCount

def testAccuracy(testData, classCount):
    correct = 0
    for x in range(len(testData)):
         print 'HERE !!!!!!!!!!!!!!'
         if testData[x][-1] is classCount[x]:
            correct += 1
    return (correct/float(len(testData))) * 100.0
def main():    
    dataPoints, labels = readFile()
    dataPointsNorm = normalizeData(dataPoints)
    trainData, testData = crossVal(dataPointsNorm)
    result, classCount = classifyKnn(testData, trainData, labels, 5)
    print result
    accuracy = testAccuracy(testData, classCount)
    print accuracy

main()

I have it normalized, split into train and test calc distance (wrong).

Thanks for any tips.

AGS
  • 14,288
  • 5
  • 52
  • 67
jackal
  • 115
  • 2
  • 10
  • `distance = np.sqrt(np.sum((trainData[0] - testData[0])**2, axis=0))` Just glancing through your code, it seems like you shouldn't need to index trainData and testData there, perhaps? – Sid Mar 24 '15 at 02:40
  • hey Sid. nice one for that, but still not fixed the issue. Result I am getting now is vector, I should be getting entire matrix of distances and then have them sorted in ascending order, from the closest to furthest point. – jackal Mar 24 '15 at 13:56
  • Just following up. This is working! Unfortunately in my edition pure hack not knowledge ;(. distance = np.sqrt(np.sum((trainData - testData)**2, axis=-1)) – jackal Mar 24 '15 at 18:43
  • If num features is m, and if num instances in train is n, then you can broadcast any m long vector with the train set and get element wise operations (in your case, to calculate euclidean distance). That's what you want to do with each (m long) instance in the test set - loop through the test set, compute it's distance with the above expression and record it. The output will be a 1D matrix of distances from given test instance to each instance in the training set. You will have to sort it after you find the distances. That's KNN for you ;). – Sid Mar 25 '15 at 16:20
  • Setting axis=0 (in a 2D matrix) will give compute distances along the 0th dimension (in your case, rows or instances, which is what you don't want). Setting axis=1 will comput distances along the 1st dimension (in your case, columns or features, which is what you want). Setting axis=-1 will set the axis to the last dimension in the larger array (in your case a 2D matrix), hence for your case, axis=1 and axis=-1 are the same. – Sid Mar 25 '15 at 16:20
  • Sid thank you very much for this valuable info about numpy matrices. – jackal Mar 28 '15 at 10:11
  • Possible duplicate of [How to do n-D distance and nearest neighbor calculations on numpy arrays](https://stackoverflow.com/questions/52366421/how-to-do-n-d-distance-and-nearest-neighbor-calculations-on-numpy-arrays) – Daniel F Sep 18 '18 at 10:47

0 Answers0