0

I am trying to determine a variety of stats (median, variance, mean, min, max) from a grouped data frame while always excluding the focal observation. Some other questions, notably here, here, and here, have touched upon this problem but the provided solutions do not seem to work for me.

Consider the following reproducible example

library(data.table)
set.seed(1)
dd <- data.table(cbind(id = rep(seq(1,5,by = 1),20)),rank = rep(seq(1,4,by=1),25),value = runif(100))

Simply calculating the mean and variance and other stats per group can be achieved as follows

setkey(dd, id, rank)
a <- dd[, list(Med = median(value, na.rm=T), Aver = mean(value, na.rm=T), Var = var(value, na.rm=T), min = min(value, na.rm=T)), by = key(dd)

I am looking for some code that does something very similar but also excludes the values in the focal row. So for row 1, it needs to calculate the median, mean, variance, and minimum of value for all observations that have id = 1 and rank = 1, excluding the first value (given set.seed(1), this value should be 0.26550866.

Following the suggestion of @Khashaa , I need to clarify that the end result should be a data.table of the same length as the original data.table. Hence, for the id = 1, rank = 1 combinations, I need different results for rows 1, 21, 41, 61, and 81 (in dd, before setkey is executed) because every time a different value is being omitted.

After the setkey command, the first 5 rows all have id = 1 and rank = 1, so what I am looking for is the various stats for a specific combination of rank and id, that exclude the focal row: i.e. Concretely, for rank = 1, and id = 1

Row 1 : mean = .7757, median = .8669, var = .05415 (excluding row 1 value)
Row 3 : mean = .6369, median = .6738, var = .1146 (excluding row 3 value)
Row 5 : mean = .7335, median = .8669, var = .09977 (excluding row 5 value)

I calculated this simply as follows in R

> df <- data.frame(dd)
> mean(df[2:5,3])
[1] 0.7757967
> median(df[2:5,3])
[1] 0.8669111
> var(df[2:5,3])
[1] 0.05415191
> mean(df[c(1,2,4,5),3])
[1] 0.6369373
> median(df[c(1,2,4,5),3])
[1] 0.6737677
> var(df[c(1,2,4,5),3])
[1] 0.1145611
> mean(df[1:4,3])
[1] 0.733509
> median(df[1:4,3])
[1] 0.8669111
> var(df[1:4,3])
[1] 0.099774

Hope this clarifies !

In my data.table some of these stats will be missing which ideally would result in NA values

Community
  • 1
  • 1
SJDS
  • 1,239
  • 1
  • 16
  • 31
  • `dd[, list(Med = median(value[-1], na.rm=T),...), by = key(dd)]`? – Khashaa Mar 27 '15 at 03:39
  • Thanks, this seems to be a very simple solution and I think it does almost exactly what I need. Problem is that it creates a shorter data.frame so I'm not sure what exactly is being omitted here. What I need is a data.table of the same number of rows so that you e.g. get different values for id = 1 and rank = 1 depending on whether you are in row 1 or row 21, 41, 61, or 81. This wasn't very clear before. I update the question... – SJDS Mar 27 '15 at 04:05
  • Not very clear. Could you just add the expected outcome? – Khashaa Mar 27 '15 at 04:34
  • I can try but it's would take me hours of manual work. I think the confusion comes from the fact that I base the row numbers on dd, before executing the setkey command. After the `setkey` command, the first five rows have `id = 1` and `rank = 1` ok? So in row 1, I want to median, mean, variance ... for rows 2:5, in row 2, I want the same for rows 1,3,4,5... in row 5, I want it for rows 1:4 – SJDS Mar 27 '15 at 04:52
  • @simon_icl - you are grouping rows, it is bound to get lesser number of rows then original as your key value is not unique. – vrajs5 Mar 27 '15 at 04:55
  • @vrajs5, I understand your point. Maybe the approach via data.table is not correct then. I am indeed grouping rows but I need to get the same number of rows because the result (see part of desired outcome I just added is unique. – SJDS Mar 27 '15 at 05:02
  • For `mean` there is a simple workaround that calculates the normal mean and counts the number of occurrences, by doing something simple like `dd$dum <- 1` and then `tmp <- dd[, sum(dum), by=key(dd)` and then using this count measure (after matching it back to dd) per key value to create an adjusted mean value that excludes the value in the focal row ~ `(sum(value), by = key(dd) - value in focal row) / (count - 1)` – SJDS Mar 27 '15 at 05:11
  • You can add the expected result just for mean. Seems like leave-one-out calculation. – Khashaa Mar 27 '15 at 05:15
  • You're right @Khashaa, I calculated three examples in the updated version, hope this clarifies what I am looking for... – SJDS Mar 27 '15 at 05:21

1 Answers1

1

Does this achieve your desired outcome?

loo <- function(x, f) unlist(lapply(1:length(x), function(i)f(x[-i])))
dd[, list(Med = loo(value, median), Aver =loo(value, mean), Var = loo(value, var), min=loo(value, min)), by = key(dd)]
#    id rank       Med      Aver         Var        min
#1:  1    1 0.8669111 0.7757967 0.054151907 0.43465948
#2:  1    1 0.6278029 0.6084976 0.095213904 0.26550866
#3:  1    1 0.6737677 0.6369373 0.114561137 0.26550866
#4:  1    1 0.6278029 0.6139549 0.099762613 0.26550866
#5:  1    1 0.8669111 0.7335090 0.099774004 0.26550866
...
Khashaa
  • 7,293
  • 2
  • 21
  • 37
  • Hi, this seems to give me the result I want indeed. Thanks a lot. It's not very fast (I have over 150,000 rows) but that's not a problem. Is there anyway of avoiding the generation of warnings ? Every time there are missing values it generates a warning for both `min` and `max` and creates the value `Inf`, which ideally should be just `NA` ... – SJDS Mar 27 '15 at 05:47
  • add `na.rm=T` option to `loo` function – Khashaa Mar 27 '15 at 05:48
  • any suggestion how to make this code work not for a single column 'value' but for a matrix of columns to which the same functions need be applied ? – SJDS Nov 22 '16 at 05:11