I am trying to determine a variety of stats (median, variance, mean, min, max) from a grouped data frame while always excluding the focal observation. Some other questions, notably here, here, and here, have touched upon this problem but the provided solutions do not seem to work for me.
Consider the following reproducible example
library(data.table)
set.seed(1)
dd <- data.table(cbind(id = rep(seq(1,5,by = 1),20)),rank = rep(seq(1,4,by=1),25),value = runif(100))
Simply calculating the mean and variance and other stats per group can be achieved as follows
setkey(dd, id, rank)
a <- dd[, list(Med = median(value, na.rm=T), Aver = mean(value, na.rm=T), Var = var(value, na.rm=T), min = min(value, na.rm=T)), by = key(dd)
I am looking for some code that does something very similar but also excludes the values in the focal row. So for row 1, it needs to calculate the median, mean, variance, and minimum of value
for all observations that have id = 1
and rank = 1
, excluding the first value (given set.seed(1)
, this value should be 0.26550866.
Following the suggestion of @Khashaa , I need to clarify that the end result should be a data.table of the same length as the original data.table. Hence, for the id = 1
, rank = 1
combinations, I need different results for rows 1, 21, 41, 61, and 81 (in dd, before setkey
is executed) because every time a different value
is being omitted.
After the setkey
command, the first 5 rows all have id = 1
and rank = 1
, so what I am looking for is the various stats for a specific combination of rank and id, that exclude the focal row: i.e.
Concretely, for rank = 1
, and id = 1
Row 1 : mean = .7757, median = .8669, var = .05415 (excluding row 1 value)
Row 3 : mean = .6369, median = .6738, var = .1146 (excluding row 3 value)
Row 5 : mean = .7335, median = .8669, var = .09977 (excluding row 5 value)
I calculated this simply as follows in R
> df <- data.frame(dd)
> mean(df[2:5,3])
[1] 0.7757967
> median(df[2:5,3])
[1] 0.8669111
> var(df[2:5,3])
[1] 0.05415191
> mean(df[c(1,2,4,5),3])
[1] 0.6369373
> median(df[c(1,2,4,5),3])
[1] 0.6737677
> var(df[c(1,2,4,5),3])
[1] 0.1145611
> mean(df[1:4,3])
[1] 0.733509
> median(df[1:4,3])
[1] 0.8669111
> var(df[1:4,3])
[1] 0.099774
Hope this clarifies !
In my data.table some of these stats will be missing which ideally would result in NA
values