Note:
This is for a homework assignment in my data mining class.
I'm going to put relevant code snippets on this SO post, but you can find my entire program at http://pastebin.com/CzNFbLJ2
The dataset I'm using for this program can be found at http://archive.ics.uci.edu/ml/datasets/Iris
So I'm getting: RuntimeWarning: invalid value encountered in sqrt return np.sqrt(m)
I am attempting to find the average Mahalanobis distance of the given iris dataset (for both raw and normalized datasets). The error is only happening on the normalized version of the dataset which is making me wonder if I have incorrectly understood what normalization means (both in code and mathematically).
I thought that normalization means that each component of a vector is divided by it's vector length (causing the vector to add up to 1). I found this SO question How to normalize a 2-dimensional numpy array in python less verbose? and thought it matched up to my concept of normalization. But now my code is reporting that the Mahalanobis distance over the normalized dataset is NAN
def mahalanobis(data):
import numpy as np;
import scipy.spatial.distance;
avg = 0
count = 0
covar = np.cov(data, rowvar=0);
invcovar = np.linalg.inv(covar)
for i in range(len(data)):
for j in range(i + 1, len(data)):
if(j == len(data)):
break
avg += scipy.spatial.distance.mahalanobis(data[i], data[j], invcovar)
count += 1
return avg / count
def normalize(data):
import numpy as np
row_sums = data.sum(axis=1)
norm_data = np.zeros((50, 4))
for i, (row, row_sum) in enumerate(zip(data, row_sums)):
norm_data[i,:] = row / row_sum
return norm_data