6

I've been trying to validate my code to calculate Mahalanobis distance written in Python (and double check to compare the result in OpenCV) My data points are of 1 dimension each (5 rows x 1 column).

In OpenCV (C++), I was successful in calculating the Mahalanobis distance when the dimension of a data point was with above dimensions.

The following code was unsuccessful in calculating Mahalanobis distance when dimension of the matrix was 5 rows x 1 column. But it works when the number of columns in the matrix are more than 1:

import numpy;
import scipy.spatial.distance;
s = numpy.array([[20],[123],[113],[103],[123]]);
covar = numpy.cov(s, rowvar=0);
invcovar = numpy.linalg.inv(covar)
print scipy.spatial.distance.mahalanobis(s[0],s[1],invcovar);

I get the following error:

Traceback (most recent call last):
  File "/home/abc/Desktop/Return.py", line 6, in <module>
    invcovar = numpy.linalg.inv(covar)
  File "/usr/lib/python2.6/dist-packages/numpy/linalg/linalg.py", line 355, in inv
    return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
IndexError: tuple index out of range
garak
  • 4,713
  • 9
  • 39
  • 56
  • If I understand correctly, the `numpy.linalg.inv` method requires a [square matrix](http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.inv.html#numpy.linalg.inv) but you seem to have given it a single element array. – srgerg Mar 30 '12 at 00:17
  • I understand that. So, is there work around? – garak Mar 30 '12 at 00:28
  • One dimensional Mahalanobis distance is called z-score. This link might help: https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/z-score/ – mrtpk Jun 26 '19 at 07:56

2 Answers2

5

One-dimensional Mahalanobis distance is really easy to calculate manually:

import numpy as np
s = np.array([[20], [123], [113], [103], [123]])
std = s.std()
print np.abs(s[0] - s[1]) / std

(reducing the formula to the one-dimensional case).

But the problem with scipy.spatial.distance is that for some reason np.cov returns a scalar, i.e. a zero-dimensional array, when given a set of 1d variables. You want to pass in a 2d array:

>>> covar = np.cov(s, rowvar=0)

>>> covar.shape
()

>>> invcovar = np.linalg.inv(covar.reshape((1,1)))

>>> invcovar.shape
(1, 1)

>>> mahalanobis(s[0], s[1], invcovar)
2.3674720531046645
Danica
  • 28,423
  • 6
  • 90
  • 122
  • This was helpful. However, I don't seem to grasp it. The above manual technique that you described gets me a different answer compared to the one below using `scipy.spatial.distance.mahalanobis`. I was able to match this (2nd technique) result with the one in OpenCV. Am I missing something here? – garak Mar 30 '12 at 01:16
  • @raul_w That's because I was being stupid. :) (You need to divide by the standard deviation, not the variance -- I forgot to take the sqrt of that too.) – Danica Mar 30 '12 at 02:49
  • well, it still doesn't match :p . using standard deviation gives a result of 2.64... and using built-in function after reshaping the matrix, gives 2.36 (which I assume is right as it matches results in OpenCV). Thanks for the help. – garak Mar 30 '12 at 12:46
  • 1
    The reason for the difference is that `np.cov` by default gives the unbiased estimator for variance (dividing by N-1), while `np.std` defaults to the biased estimator (dividing by N). You can reconcile the two with the `ddof` argument. The definition of "right" here depends on exactly what it is you want to do, but the `np.cov` way is more standard (as evidenced by the fact that it's what OpenCV does :p). – Danica Mar 30 '12 at 17:44
  • One dimensional Mahalanobis distance is called z-score. This link might help: https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/z-score/ – mrtpk Jun 26 '19 at 07:56
0

Covariance needs 2 arrays to compare. In both np.cov() and Opencv CalcCovarMatrix, it expects the two arrays to be stacked on top of each other (Use vstack). You can also have the 2 arrays to be side by side if you change the Rowvar to false in numpy or use COVAR_COL in opencv. If your arrays are multidimentional, just flatten() them first.

So if I want to compare two 24x24 images, I flatten them both into 2 1x1024 images, then stack the two to get a 2x1024, and that is the first argument of np.cov().

You should then get a large square matrix, where it shows the results of comparing each element in array1 with each element in array2. In my example it will be 1024x1024. THAT is what you pass into your invert function.

john k
  • 6,268
  • 4
  • 55
  • 59