16

Sometimes I need to count the number of non-NA elements in one or another column in my data.table. What is the best data.table-tailored way to do so?

For concreteness, let's work with this:

DT <- data.table(id = sample(100, size = 1e6, replace = TRUE),
                 var = sample(c(1, 0, NA), size = 1e6, replace = TRUE), key = "id")

The first thing that comes to my mind works like this:

DT[!is.na(var), N := .N, by = id]

But this has the unfortunate shortcoming that N does not get assigned to any row where var is missing, i.e. DT[is.na(var), N] = NA.

So I work around this by appending:

DT[!is.na(var), N:= .N, by = id][ , N := max(N, na.rm = TRUE), by = id] #OPTION 1

However, I'm not sure this is the best approach; another option I thought of and one suggested by the analog to this question for data.frames would be:

DT[ , N := length(var[!is.na(var)]), by = id] # OPTION 2

and

DT[ , N := sum(!is.na(var)), by = id] # OPTION 3

Comparing computation time of these (average over 100 trials), the last seems to be the fastest:

OPTION 1 | OPTION 2 | OPTION 3
  .075   |   .065   |   .043

Does anyone know a speedier way for data.table?

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198

1 Answers1

10

Yes the option 3rd seems to be the best one. I've added another one which is valid only if you consider to change the key of your data.table from id to var, but still option 3 is the fastest on your data.

library(microbenchmark)
library(data.table)

dt<-data.table(id=(1:100)[sample(10,size=1e6,replace=T)],var=c(1,0,NA)[sample(3,size=1e6,replace=T)],key=c("var"))

dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)
dt4 <- copy(dt)

microbenchmark(times=10L,
               dt1[!is.na(var),.N,by=id][,max(N,na.rm=T),by=id],
               dt2[,length(var[!is.na(var)]),by=id],
               dt3[,sum(!is.na(var)),by=id],
               dt4[.(c(1,0)),.N,id,nomatch=0L])
# Unit: milliseconds
#                                                         expr      min       lq      mean    median        uq       max neval
#  dt1[!is.na(var), .N, by = id][, max(N, na.rm = T), by = id] 95.14981 95.79291 105.18515 100.16742 112.02088 131.87403    10
#                     dt2[, length(var[!is.na(var)]), by = id] 83.17203 85.91365  88.54663  86.93693  89.56223 100.57788    10
#                             dt3[, sum(!is.na(var)), by = id] 45.99405 47.81774  50.65637  49.60966  51.77160  61.92701    10
#                        dt4[.(c(1, 0)), .N, id, nomatch = 0L] 78.50544 80.95087  89.09415  89.47084  96.22914 100.55434    10
jangorecki
  • 16,384
  • 4
  • 79
  • 160
  • 2
    @MichaelChirico, have a look at the [Keys and binary search based subsets](https://github.com/Rdatatable/data.table/wiki/Getting-started) (and let us know if you find something unclear / to be added [here](https://github.com/Rdatatable/data.table/issues/944). – Arun Apr 16 '15 at 23:04
  • There is another option: `dt5[,.N[!is.na(var)],by=id]` - third fastest on my system and quite a lot better than options 1 and 2. For some reason option 3 produces binary responses (0,1) instead of counts if more than one 'by' variable is used. The suggested option 5 gives correct counts. – Amy M Dec 11 '17 at 18:38