For speed, with a large number of varcols
, perhaps look to iterate by column. Something like this (untested) :
keep = rep(TRUE,nrow(x))
for (j in varcols) keep[is.na(x[[j]])] = FALSE
x[keep]
The issue with is.na
is that it creates a new logical vector to hold its result, which then must be looped through by R to find the TRUEs so it knows which of the keep to set FALSE. However, in the above for loop, R can reuse the (identically sized) previous temporary memory for that result of is.na
, since it is marked unused and available for reuse after each iteration completes. IIUC.
1. is.na(x[, ..varcols]
)
This is ok but creates a large copy to hold the logical matrix as large as length(varcols)
. And the ==0
on the result of rowSums
will need a new vector, too.
2. !is.na(var1) & !is.na(var2)
Ok too, but !
will create a new vector again and so will &
. Each of the results of is.na
have to be held by R separately until the expression completes. Probably makes no difference until length(varcols)
increases a lot, or ncol(x)
is very large.
3. CJ(c(0,1),c(0,1))
Best so far but not sure how this would scale as length(varcols)
increases. CJ
needs to allocate new memory, and it loops through to populate that memory with all the combinations, before the join can start.
So, the very fastest (I guess), would be a C version like this (pseudo-code) :
keep = rep(TRUE,nrow(x))
for (j=0; j<varcols; j++)
for (i=0; i<nrow(x); i++)
if (keep[i] && ISNA(x[i,j])) keep[i] = FALSE;
x[keep]
That would need one single allocation for keep
(in C or R) and then the C loop would loop through the columns updating keep
whenever it saw an NA. The C could be done in Rcpp, in RStudio, inline package, or old school. It's important the two loops are that way round, for cache efficiency. The thinking is that the keep[i] &&
part helps speed when there are a lot of NA in some rows, to save even fetching the later column values at all after the first NA in each row.