Taking group means, excluding the observation itself (and dealing with NA's)

Question

I am calculating group means with data.table as follows:

library(data.table)

DF <- fread(
    "A   B   D  value  iso   year   
     0   1   1  NA ECU   2009   
     1   0   2  1  ECU   2009   
     1   0   1  2  ECU   2009
     0   0   3  1  BRA   2011   
     1   0   4  0  BRA   2011 
     0   0   3  1  BRA   2011     
     0   1   7  NA ECU   2008   
     1   0   1  1  ECU   2008   
     1   0   1  1  ECU   2008   
     0   0   3  2  BRA   2012   
     0   0   3  2  BRA   2012   
     1   0   4  NA BRA   2012",
  header = TRUE
)

setDT(DF)[,mean_value := mean(value, na.rm=TRUE),  by=c("iso", "year")]

In order to exclude the current observation from the calculation, I tried incorporating (sum(value) - value)/(n()-1) from this answer, and do:

setDT(DF)[,mean_value := (sum(value, na.rm=TRUE)-value)/(.N-1),  by=c("iso", "year")]

But I am worried about .N-1

setDT(DF)[,n_val:= (.N-1),  by=c("iso", "year")]

It always gives a value of 2, where it should give a value of one when there is only one observation not NA. I tried to go with:

setDT(DF)[,mean_value := (sum(value, na.rm=TRUE)-value)/(colSums(!is.na(value))-1),  by=c("iso", "year")]

But that gives:

Error in base::colSums(x, na.rm = na.rm, dims = dims, ...) : 
  'x' must be an array of at least two dimensions

and

setDT(DF)[,mean_value := (sum(value, na.rm=TRUE)-value)/(.N[!is.na(value)]-1),  by=c("iso", "year")]

But that leaves too little observations. What is the missing piece of the puzzle here?

talat · Accepted Answer · 2021-06-03T10:24:56.437

2

IIUC this should do what you're looking for:

DF[, mean_value := (sum(value, na.rm=TRUE)-value)/(sum(!is.na(value))-!is.na(value)),  
     by=c("iso", "year")]

    A B D value iso year mean_value
 1: 0 1 1    NA ECU 2009         NA
 2: 1 0 2     1 ECU 2009        2.0
 3: 1 0 1     2 ECU 2009        1.0
 4: 0 0 3     1 BRA 2011        0.5
 5: 1 0 4     0 BRA 2011        1.0
 6: 0 0 3     1 BRA 2011        0.5
 7: 0 1 7    NA ECU 2008         NA
 8: 1 0 1     1 ECU 2008        1.0
 9: 1 0 1     1 ECU 2008        1.0
10: 0 0 3     2 BRA 2012        2.0
11: 0 0 3     2 BRA 2012        2.0
12: 1 0 4    NA BRA 2012         NA

Note: you may want to additionally consider edge cases like a group of size 1 with NA value which would lead to division by zero

edited Jun 03 '21 at 10:24

answered Jun 03 '21 at 10:15

talat

68,970
21
126
157

Thank you I added na.rm=TRUE to my code. but it is still not the same output as yours. Could you please help? Thanks. – TarJae Jun 03 '21 at 10:19
1

@TarJae, look at what `(.N-1)` does in your code: it doesnt care about NAs. so in row 3 you're dividing by .N-1=2 but there is only 1 non-NA value you sum up (when excluding the current row) – talat Jun 03 '21 at 10:22
Thank you @talat. I have checked the group sizes beforehand, but reading your comment, it would be nice to be able to include a minimum group size (like 10) somewhere.. But I assume that the condition (before the comma) does not respect the `by` of the same command right? I will just have to create a new variable that is `NA` when the group size is under 10, and then apply this code to that variable. – Tom Jun 03 '21 at 10:32
1

@Tom right, the condition before the comma doesnt repect the grouping. But you can always add a condition inside j. For example, `DF[, mean_value := if(.N > 10) (sum(value, na.rm=TRUE)-value)/(sum(!is.na(value))-!is.na(value)) else NA, by=c("iso", "year")]` – talat Jun 03 '21 at 11:57
That is great! Thank you very much! – Tom Jun 03 '21 at 13:21

Taking group means, excluding the observation itself (and dealing with NA's)

1 Answers1