8

I have the following code in R that calculates the mahalanobis distance on the Iris dataset and returns a numeric vector with 150 values, one for every observation in the dataset.

x=read.csv("Iris Data.csv")
mean<-colMeans(x)
Sx<-cov(x)
D2<-mahalanobis(x,mean,Sx)  

I tried to implement the same in Python using 'scipy.spatial.distance.mahalanobis(u, v, VI)' function, but it seems this function takes only one-dimensional arrays as parameters.

jose14
  • 83
  • 1
  • 6

2 Answers2

8

I used the Iris dataset from R, I suppose it is the same you are using.

First, these is my R benchmark, for comparison:

x <- read.csv("IrisData.csv")
x <- x[,c(2,3,4,5)]
mean<-colMeans(x)
Sx<-cov(x)
D2<-mahalanobis(x,mean,Sx)  

Then, in python you can use:

from scipy.spatial.distance import mahalanobis
import scipy as sp
import pandas as pd

x = pd.read_csv('IrisData.csv')
x = x.ix[:,1:]

Sx = x.cov().values
Sx = sp.linalg.inv(Sx)

mean = x.mean().values

def mahalanobisR(X,meanCol,IC):
    m = []
    for i in range(X.shape[0]):
        m.append(mahalanobis(X.iloc[i,:],meanCol,IC) ** 2)
    return(m)

mR = mahalanobisR(x,mean,Sx)

I defined a function so you can use it in other sets, (observe I use pandas DataFrames as inputs)

Comparing results:

In R

> D2[c(1,2,3,4,5)]

[1] 2.134468 2.849119 2.081339 2.452382 2.462155

In Python:

In [43]: mR[0:5]
Out[45]: 
[2.1344679233248431,
 2.8491186861585733,
 2.0813386639577991,
 2.4523816316796712,
 2.4621545347140477]

Just be careful that what you get in R is the squared Mahalanobis distance.

  • This is exactly what I was looking for. I'm not really well versed with Python yet so I was struggling over it. Thanks a ton! – jose14 Apr 24 '15 at 04:46
  • Could you also suggest how to use the Mahalanobis distances to perform outlier detection? How can we decide upon the threshold value of the distances in order to detect outliers? – jose14 Apr 24 '15 at 09:14
  • You could use the $(1- \alpha)$ percentile of a $\chi^2$ distribution as a threshold. See here for a (very much!) descriptive answer about statistical aspects of Mahalanobis distance: http://stats.stackexchange.com/questions/62092/bottom-to-top-explanation-of-the-mahalanobis-distance – Cristián Antuña Apr 24 '15 at 13:24
  • 1
    Note: `df.ix` is [deprecated](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.ix.html) and should be replaced with `df.iloc` – Michael Boles Aug 09 '21 at 20:52
  • Thank you @MichaelBoles I just edited the post to reflect that change. – Cristián Antuña Aug 10 '21 at 15:22
0

A simpler solution would be:

from scipy.spatial.distance import cdist

x = ...

mean = x.mean(axis=0).reshape(1, -1)  # make sure 2D
vi = np.linalg.inv(np.cov(x.T))

cdist(mean, x, 'mahalanobis', VI=vi)
sencer
  • 345
  • 5
  • 17