Mahalanobis difference by group with dplyr

Question

I want to get a Mahalanobis difference for each set of two scores, after being grouped by another variable. In this case, it would be a Mahalanobis difference for each Attribute (across each set of 2 scores). The output should be 3 Mahalanobis distances (one for A, B and C).

Currently I am working with (in my original dataframe, there are some NAs, hence I include one in the reprex):

library(tidyverse)
library(purrr)



df <- tibble(Attribute = unlist(map(LETTERS[1:3], rep, 5)),
             Score1 = c(runif(7), NA, runif(7)),
             Score2 = runif(15))

mah_db <- df %>% 
  dplyr::group_by(Attribute) %>% 
  dplyr::summarise(MAH = mahalanobis(Score1:Score2, 
                                     center = base::colMeans(Score1:Score2), 
                                     cov(Score1:Score2, use = "pairwise.complete.obs")))

This raises the error:

Caused by error in base::colMeans(): ! 'x' must be an array of at least two dimensions

But as far as I can tell, I am giving colMeans two columns.

So what's going wrong here? And I wonder if even fixing this gives a complete solution?

You get your answers here: https://stackoverflow.com/questions/53660849/r-use-group-by-and-mutate-in-dplyr-to-apply-function-that-returns-a-vector. — Maël, May 13 '22 at 08:37
Does this answer your question? [R - use group\_by() and mutate() in dplyr to apply function that returns a vector the length of groups](https://stackoverflow.com/questions/53660849/r-use-group-by-and-mutate-in-dplyr-to-apply-function-that-returns-a-vector) — Maël, May 13 '22 at 08:37
Hm, close! But this produces a Mahalanobis distance per set of scores rather than per group. I can probably adjust this, but it's not a complete solution. — eartoolbox, May 13 '22 at 08:46
AFAIK, the `mahalanobis` function will return a vector for each pair of observation, so do you want a mean of the distance for each attribute or am I missing something? — Maël, May 13 '22 at 08:56
I want a mahalanobis distance per group (here = 3). Obviously I could take the mean, but I'm not sure that's the same thing? — eartoolbox, May 13 '22 at 08:58

score 0 · Answer 1 · answered May 13 '22 at 11:22

It seems your question is more about the statistics than dplyr. So I just give a small example based on your data and an adapted example from ?mahalanobis. Perhaps also have a look here or here.

df <- subset(x = df0, Attribute == "A", select = c("Score1", "Score2"))
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
plot(density(df$mahalanobis, bw = 0.3), ylim = c(0, 0.8),
     main="Squared Mahalanobis distances"); 
grid()
rug(df$mahalanobis)

df <- subset(x = df0, Attribute == "B", select = c("Score1", "Score2"))
df <- df[complete.cases(df), ]
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
lines(density(df$mahalanobis, bw = 0.3), col = "red",
     main="Squared Mahalanobis distances"); 
rug(df$mahalanobis, col = "red")

df <- subset(x = df0, Attribute == "C", select = c("Score1", "Score2"))
df$mahalanobis <- mahalanobis(x = df, center = colMeans(df), cov = cov(df))
df$p <- pchisq(q = df$mahalanobis, df = 2, lower.tail = FALSE)
lines(density(df$mahalanobis, bw = 0.3), col = "green",
     main="Squared Mahalanobis distances"); 
rug(df$mahalanobis, col = "green")

Hope, that helps (and too long for a comment).
(Of course you can make to code much shorter, but it shows in each step what happens.)

Mahalanobis difference by group with dplyr

1 Answers1