1

I'm trying to summarize (take the mean) of various columns based on a single column within a data.table.

Here's a toy example of my data and the code I used that shows the problem I'm having:

library(data.table)
a<- data.table(
  a=c(1213.1,NA,113.41,133.4,121.1,45.34),
  b=c(14.131,NA,1.122,113.11,45.123,344.3),
  c=c(101.2,NA,232.1,194.3,12.12,7645.3),
  d=c(11.32,NA,32.121,94.3213,1223.1,34.1),
  e=c(1311.32,NA,12.781,13.2,2.1,623.2),
  f=c("A", "B", "B", "A", "B", "X"))
a
setkey(a,f) # column "f" is what I want to summarize columns by

a[, lapply(.SD, mean), by=f, .SDcols=c(1:4)] # I just want to summarize first 4 columns

The output of the last line:

> a[, lapply(.SD, mean), by=f, .SDcols=c(1:4)]
 f a b c d
1: A 673.25 63.6205 147.75 52.82065
2: B NA NA NA NA
3: X 45.34 344.3000 7645.30 34.10000

Why are B entries NA? Shouldn't NA be ignored in the calculation of the mean? I think I found a similar issue here, but perhaps this is different and/or I've got the syntax messed up.

If this isn't possible in data.table, I'm open to other suggestions.

smci
  • 32,567
  • 20
  • 113
  • 146
Prophet60091
  • 589
  • 9
  • 23

1 Answers1

2

In R, the default behavior of the mean() function is to output NA if there are missing values. To ignore NAs in the mean calculation, you need to set the argument na.rm=TRUE. lapply takes in additional arguments to the function it is passed, so for your problem, you can try

a[, lapply(.SD, mean, na.rm=TRUE), by=f, .SDcols=c(1:4)]

ialm
  • 8,510
  • 4
  • 36
  • 48
  • Sorry, but why is this not a `data.table` question (I rolled back your tag edit to the Q)? :-O – Arun Nov 01 '14 at 11:21
  • Because it is mere coincidence that this behaviour was experienced within a data.table; even without `library(data.table)` the mean of a column containing `NA` is still `NA`. – Hugh Nov 01 '14 at 12:31
  • @Hugh, it helps (future) users having the same question identify this answer quickly with the data.table tag. The question is titled "data.table gives ...". It'd be hard for people who don't really understand what's going on to search the entire R tag. – Arun Nov 02 '14 at 12:51