How to use Mahalanobis distance to find the K Nearest Neighbor in R

Question

I have a time series dataset from 1970 to 2020 as my training dataset, and I have another single observation of 2021, what I have to do right now is to use Mahalanobis distance to identify 10 nearest neighbor of 2021 in training dataset. I tried several function like get.knn() and get.knnx(), but I failed to set distance as Mahalanobis distance. Is there any function that i can use? Thank you in advance!

------------------edit--------------------

So I tried function of mahalanobis() and then I got a list of values, are these values the mahalanobis distance? Can I sort them to get the top 10?

Try `biotools` package. It also has some mahalanobis functions — Nad Pat, Nov 14 '21 at 04:58

score 0 · Answer 1 · answered Nov 20 '21 at 13:45

Background

The Mahalanobis distance measures how far a point is away from the mean, measured in standard deviations, see Wikipedia. It uses eigenvalue rotated coordinates and is related to pricipal component analysis. Cross Validated contains several excellent explanations, e.g. this "bottom-to-top-explanation" or a function (cholMaha, see below) how to estimate a distance matrix.

Relationship of Mahalanobis distance to PCA

Let's assume a small data example:

A <- data.frame(
  x = c(-2.48, -4.03, 1.15, 0.94, 5.33, 4.72),
  y = c(-3.92, -3.4, 0.92, 0.78, 3.44, 0.5),
  z = c(-1.11, -2.18, 0.21, 0.34, 1.74, 1.12)
)

Then we can estimate the Mahalanobis distance matrix via D2.distfrom package biotools or the above mentioned function:

## Mahalanobis distance from package biotools
library("biotools")
# sqrt, because D2.dist returns squared version
sqrt(D2.dist(A, cov(A)))


## https://stats.stackexchange.com/questions/65705/pairwise-mahalanobis-distances

cholMaha <- function(X) {
  dec <- chol( cov(X) )
  tmp <- forwardsolve(t(dec), t(X) )
  dist(t(tmp))
}

cholMaha(A)

Now comes the point. We can also estimate the Mahalanobis distance as the euclidean distance of the re-scaled loadings (rotated data) of a principal components analysis:

## derive Mahalanobis distance from principal components
pc <- prcomp(A)     # principal components
AA <- scale(pc$x)   # "upscale" all components to the same level

# Euclidean distance of rescaled PC transformed variables is identical to
# Mahalanobis distance
dist(AA)

The result is identical to the two approaches above.

Application to a classification algorithm

One can now use this relation in any classification algorithm. Just transform the data matrix by PCA rotation and fit their euclidean distances in.

## Now apply this in any classification method, e.g. hclust
par(mfrow=c(1, 2))

# Euclidean distance of original variables
plot(hclust(dist(scale(A))), main="Euclidean")

# Euclidean distance of scaled principal components
# is equivalent to Mahalanobis distance; considers covariance
plot(hclust(dist(AA)), main="Mahalanobis")

In effect, small influece factors hidden in the variables are upscaled, but unfortunately also random errors. To understand this in detail, read the "My grandma cooks" answer at Cross Validated.

How to use Mahalanobis distance to find the K Nearest Neighbor in R

1 Answers1

Background

Relationship of Mahalanobis distance to PCA

Application to a classification algorithm