Sometimes I need to count the number of non-NA
elements in one or another column in my data.table
. What is the best data.table
-tailored way to do so?
For concreteness, let's work with this:
DT <- data.table(id = sample(100, size = 1e6, replace = TRUE),
var = sample(c(1, 0, NA), size = 1e6, replace = TRUE), key = "id")
The first thing that comes to my mind works like this:
DT[!is.na(var), N := .N, by = id]
But this has the unfortunate shortcoming that N
does not get assigned to any row where var
is missing, i.e. DT[is.na(var), N] = NA
.
So I work around this by appending:
DT[!is.na(var), N:= .N, by = id][ , N := max(N, na.rm = TRUE), by = id] #OPTION 1
However, I'm not sure this is the best approach; another option I thought of and one suggested by the analog to this question for data.frame
s would be:
DT[ , N := length(var[!is.na(var)]), by = id] # OPTION 2
and
DT[ , N := sum(!is.na(var)), by = id] # OPTION 3
Comparing computation time of these (average over 100 trials), the last seems to be the fastest:
OPTION 1 | OPTION 2 | OPTION 3
.075 | .065 | .043
Does anyone know a speedier way for data.table
?