Background
The Mahalanobis distance measures how far a point is away from the mean, measured in standard deviations, see Wikipedia. It uses eigenvalue rotated coordinates and is related to pricipal component analysis. Cross Validated contains several excellent explanations, e.g. this "bottom-to-top-explanation" or a function (cholMaha
, see below) how to estimate a distance matrix.
Relationship of Mahalanobis distance to PCA
Let's assume a small data example:
A <- data.frame(
x = c(-2.48, -4.03, 1.15, 0.94, 5.33, 4.72),
y = c(-3.92, -3.4, 0.92, 0.78, 3.44, 0.5),
z = c(-1.11, -2.18, 0.21, 0.34, 1.74, 1.12)
)
Then we can estimate the Mahalanobis distance matrix via D2.dist
from package biotools or the above mentioned function:
## Mahalanobis distance from package biotools
library("biotools")
# sqrt, because D2.dist returns squared version
sqrt(D2.dist(A, cov(A)))
## https://stats.stackexchange.com/questions/65705/pairwise-mahalanobis-distances
cholMaha <- function(X) {
dec <- chol( cov(X) )
tmp <- forwardsolve(t(dec), t(X) )
dist(t(tmp))
}
cholMaha(A)
Now comes the point. We can also estimate the Mahalanobis distance as the euclidean distance of the re-scaled loadings (rotated data) of a principal components analysis:
## derive Mahalanobis distance from principal components
pc <- prcomp(A) # principal components
AA <- scale(pc$x) # "upscale" all components to the same level
# Euclidean distance of rescaled PC transformed variables is identical to
# Mahalanobis distance
dist(AA)
The result is identical to the two approaches above.
Application to a classification algorithm
One can now use this relation in any classification algorithm. Just transform the data matrix by PCA rotation and fit their euclidean distances in.
## Now apply this in any classification method, e.g. hclust
par(mfrow=c(1, 2))
# Euclidean distance of original variables
plot(hclust(dist(scale(A))), main="Euclidean")
# Euclidean distance of scaled principal components
# is equivalent to Mahalanobis distance; considers covariance
plot(hclust(dist(AA)), main="Mahalanobis")
In effect, small influece factors hidden in the variables are upscaled, but unfortunately also random errors. To understand this in detail, read the "My grandma cooks" answer at Cross Validated.