6

I'm working with a large dataset x. I want to drop rows of x that are missing in one or more columns in a set of columns of x, that set being specified by a character vector varcols.

So far I've tried the following:

require(data.table)
x <- CJ(var1=c(1,0,NA),var2=c(1,0,NA))
x[, textcol := letters[1:nrow(x)]]
varcols <- c("var1","var2")

x[, missing := apply(sapply(.SD,is.na),1,any),.SDcols=varcols]
x <- x[!missing]

Is there a faster way of doing this? Thanks.

thelatemail
  • 91,185
  • 12
  • 128
  • 188
Michael
  • 5,808
  • 4
  • 30
  • 39

4 Answers4

9

This should be faster than using apply:

x[rowSums(is.na(x[, ..varcols])) == 0, ]
#    var1 var2 textcol
# 1:    0    0       e
# 2:    0    1       f
# 3:    1    0       h
# 4:    1    1       i
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
flodel
  • 87,577
  • 21
  • 185
  • 223
4

Here is a revised version of a c++ solution with a number of modifications based on a long discussion with Matthew (see comments below). I am new to c so I am sure that someone might still be able to improve this.

After library("RcppArmadillo") you should be able to run the whole file including the benchmark using sourceCpp('cleanmat.cpp'). The c++-file includes two functions. cleanmat takes two arguments (X and the index of the columns) and returns the matrix without the columns with missing values. keep just takes one argument X and returns a logical vector.

Note about passing data.table objects: These functions do not accept a data.table as an argument. The functions have to be modified to take DataFrame as an argument (see here.

cleanmat.cpp

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;
using namespace arma;


// [[Rcpp::export]]
mat cleanmat(mat X, uvec idx) {
    // remove colums
    X = X.cols(idx - 1);
    // get dimensions
    int n = X.n_rows,k = X.n_cols;
    // create keep vector
    vec keep = ones<vec>(n);
    for (int j = 0; j < k; j++) 
        for (int i = 0; i < n; i++) 
            if (keep[i] && !is_finite(X(i,j))) keep[i] = 0;
    // alternative with view for each row (slightly slower)
    /*vec keep = zeros<vec>(n);
    for (int i = 0; i < n; i++) {
         keep(i) = is_finite(X.row(i));
    }*/  
    return (X.rows(find(keep==1)));
}


// [[Rcpp::export]]
LogicalVector keep(NumericMatrix X) {
    int n = X.nrow(), k = X.ncol();
    // create keep vector
    LogicalVector keep(n, true);
    for (int j = 0; j < k; j++) 
        for (int i = 0; i < n; i++) 
            if (keep[i] && NumericVector::is_na(X(i,j))) keep[i] = false;

    return (keep);
}


/*** R
require("Rcpp")
require("RcppArmadillo")
require("data.table")
require("microbenchmark")

# create matrix
X = matrix(rnorm(1e+07),ncol=100)
X[sample(nrow(X),1000,replace = TRUE),sample(ncol(X),1000,replace = TRUE)]=NA
colnames(X)=paste("c",1:ncol(X),sep="")

idx=sample(ncol(X),90)
microbenchmark(
  X[!apply(X[,idx],1,function(X) any(is.na(X))),idx],
  X[rowSums(is.na(X[,idx])) == 0, idx],
  cleanmat(X,idx),
  X[keep(X[,idx]),idx],
times=3)

# output
# Unit: milliseconds
#                                                     expr       min        lq    median        uq       max
# 1                                       cleanmat(X, idx)  253.2596  259.7738  266.2880  272.0900  277.8921
# 2 X[!apply(X[, idx], 1, function(X) any(is.na(X))), idx] 1729.5200 1805.3255 1881.1309 1913.7580 1946.3851
# 3                                 X[keep(X[, idx]), idx]  360.8254  361.5165  362.2077  371.2061  380.2045
# 4                  X[rowSums(is.na(X[, idx])) == 0, idx]  358.4772  367.5698  376.6625  379.6093  382.5561

*/
Community
  • 1
  • 1
user2503795
  • 4,035
  • 2
  • 34
  • 49
  • There's quite a bit to improve here. Search for some of my comments and answers about benchmarking for example. The main things are i) 500 replications could mean you timing call overhead (3 runs on large data is usually a better way to scale) and ii) you haven't included the timing results as a table (absolute times are often instructive). – Matt Dowle Dec 08 '12 at 16:03
  • And is it intended to be a follow up on my answer? You didn't actually say so. It's different enough I wasn't sure. – Matt Dowle Dec 08 '12 at 16:08
  • Here are the timing results, and yes it was a follow up, which I guess got lost in a revision. Now it's stated clearly. I will look at some of your older comments and see how I can improve this. I am just getting into `rcpp` and see this as an exercise so that any suggestions for improvements are welcome. I will also look into a solution for passing in a `data.table` but I don't think that I can return one. – user2503795 Dec 08 '12 at 16:44
  • Thanks great exercise. Next step would be increase the size of the data, a lot. – Matt Dowle Dec 08 '12 at 16:58
  • Instead of doing the x[keep] bit on the Rcpp side, you could return keep and do x[keep] on the R side. The main part to do in C is the nested loop that updates keep. – Matt Dowle Dec 08 '12 at 17:07
  • Or you could return a DataFrame from Rcpp and setattr the class to be a data.table. There's an example in S.O. somewhere (not related to Rcpp but about instant methods to change type). But returning keep should be easier for now. – Matt Dowle Dec 08 '12 at 17:12
  • I tried returning the keep part and doing the column selection on the R side so that you call `X[cleanmat2(X),varcols]`. It's the fasted solution for the smaller matrix but actually slower with the larger one. – user2503795 Dec 08 '12 at 17:41
  • Where's the nested for loop on the C side? Is the test on just one column in varcols? – Matt Dowle Dec 08 '12 at 18:14
  • And check optimization of Rcpp compile is on (-O3) and no debug (no -g). – Matt Dowle Dec 08 '12 at 18:17
  • There is actually an error because I have to call `X2[cleanmat2(X2[,varcols]),varcols]` to get the correct result with `cleanmat2` but the timing results are very similar. About the nesting: `X.row(i)` results a vector and `is_finite` checks whether any element in that vector is missing. Or am I misunderstanding your question? `varcols` contains two columns but can contain an arbitrary number of columns. Not sure how to turn debug of but I will look into it. – user2503795 Dec 08 '12 at 18:20
  • With more columns in `varcols` `cleanmat` seems to be slightly fast compared to `cleanmat2` and the `rowSums` solution... – user2503795 Dec 08 '12 at 18:27
  • Then X.rows(i) is likely the problem. I guess that creates a new vector for each row, a huge no no, if so. It needs to be a nested loop like I showed in pseudo code. Loop through each column contiguously to minimise L2 cache misses, never loop by row. – Matt Dowle Dec 08 '12 at 18:49
  • `X.rows(i)` actually create a view, which I thought is efficient because no data is copied to a new object. Anyways, I tried the nested loop and it is actually about 1.6 as fast as the `X.rows(i)` solution in `cleanmat2` but only slightly fast compared to `cleanmat` I guess because of the ways in which the columns are selected in the two solutions (`cleanmat2` does that twice). – user2503795 Dec 08 '12 at 19:51
  • I'm a bit lost with all the revisions. You've tried the nested loop but haven't shown it, so I can't fully trust what you're saying. What precisely is a view? If it is a vector in any shape or form then it's a new vector. Or is Rcpp optimizing it somehow. The fastest solution will need no new memory other than `keep`, and that's one reason it should be faster. If you're only seeing a small improvement, then look at the data size. `X2` is just 7MB. That probably fits entirely in L2 cache, depending on your hardware. Needs to be 10 or 100 times bigger, say 700MB object.size. – Matt Dowle Dec 09 '12 at 10:48
  • Btw, the answer still says you can't pass a data.table to Rcpp, but you can, and are. There is no downside in using a data.table as a DataFrame on the Rcpp side, afaik. "Can't" is a strong word! – Matt Dowle Dec 09 '12 at 10:50
  • Also varcols needs to be much longer, say 100 or 1000. Not 2. Any differences between methods when varcols is just 2, may be statistically significant, but not of any practical significance (just too small to care about). You're showing differences of 10 microseconds in some of these tests! That could be the time to map column names to numbers, once, iiuc. – Matt Dowle Dec 09 '12 at 10:55
  • 1
    If you could maybe start the answer from scratch again, would be great. One large example, and one benchmark result table, where the absolute times take seconds or minutes. Also with a random sprinkling of NAs would be a high quality benchmark, as well as the no NA case and fully NA case, perhaps. – Matt Dowle Dec 09 '12 at 11:03
  • Yes, I will revise my answer. Here is some [info](http://www.drdobbs.com/cpp/c-view-objects/196513737) about views: "When data needs to be presented in different ways, the natural solution is to create multiple copies of the data with each copy arranged in a specific manner. This is not always the most efficient solution and can lead to overcomplicated or duplicated code further down the line. This is the problem that view objects solve; they provide an efficient and concise means of presenting data in multiple ways without inefficiency." – user2503795 Dec 09 '12 at 15:58
  • Thanks. So a view is a vector of integers or pointers then. We dont need that here. Do you see why creating a view in this case is less efficient than not creating a view? Their "without inefficiency" is in the context of the whole sentence about presenting data in multiple ways [at the same time]. – Matt Dowle Dec 09 '12 at 17:18
  • I just revised the answer with a new benchmark based on a larger matrix and random NAs. It's still not 700MB though. My c++ compiler is broken for the 64 bit version of R so that I have problems allocating much larger vectors. – user2503795 Dec 09 '12 at 18:03
  • Great, it's getting there. The `X[,idx]` being passed to `keep` is taking a copy of those columns, though (`idx` looks to be 90 random columns of the 100, which is a good test). Should be faster if the call was just `keep(x,idx)` and then inside `keep` loop `j` through the elements of `idx` passed in. – Matt Dowle Dec 09 '12 at 21:11
  • And instead of `X[sample(nrow(X),1000,replace = TRUE),sample(ncol(X),1000,replace = TRUE)]` I guess you probably meant `X[cbind(sample(nrow(X),1000,replace = TRUE),sample(ncol(X),1000,replace = TRUE))]`? The first assigns to the union of the columns and the rows. The 2nd (subsetting a matrix by a 2 column matrix) is the way to assign to particular cells. – Matt Dowle Dec 09 '12 at 21:15
  • At 70MB, `X` is still quite small. 10 times bigger, 700MB should be ok on 32bit. I only have 2GB RAM and I just ran `X = matrix(rnorm(1e8),ncol=100)` ok. You shouldn't need 64bit for 700MB, iiuc, assuming you have 2GB RAM. – Matt Dowle Dec 09 '12 at 21:32
  • Note also that `rowSums` works for matrices (all columns have the same type) and it converts `data.frame` to `matrix` coercing all columns to the same type (copying in doing so). Since the question gave a `data.table` as example data, I took it that mixed type columns was a feature of the input (precluding matrix). `rowSums` might fall away when the C code is converted from `NumericMatrix` to `DataFrame`. – Matt Dowle Dec 09 '12 at 21:50
3

For speed, with a large number of varcols, perhaps look to iterate by column. Something like this (untested) :

keep = rep(TRUE,nrow(x))
for (j in varcols) keep[is.na(x[[j]])] = FALSE
x[keep]

The issue with is.na is that it creates a new logical vector to hold its result, which then must be looped through by R to find the TRUEs so it knows which of the keep to set FALSE. However, in the above for loop, R can reuse the (identically sized) previous temporary memory for that result of is.na, since it is marked unused and available for reuse after each iteration completes. IIUC.

1. is.na(x[, ..varcols])
This is ok but creates a large copy to hold the logical matrix as large as length(varcols). And the ==0 on the result of rowSums will need a new vector, too.

2. !is.na(var1) & !is.na(var2)
Ok too, but ! will create a new vector again and so will &. Each of the results of is.na have to be held by R separately until the expression completes. Probably makes no difference until length(varcols) increases a lot, or ncol(x) is very large.

3. CJ(c(0,1),c(0,1))
Best so far but not sure how this would scale as length(varcols) increases. CJ needs to allocate new memory, and it loops through to populate that memory with all the combinations, before the join can start.

So, the very fastest (I guess), would be a C version like this (pseudo-code) :

keep = rep(TRUE,nrow(x))
for (j=0; j<varcols; j++)
   for (i=0; i<nrow(x); i++)
       if (keep[i] && ISNA(x[i,j])) keep[i] = FALSE;
x[keep]

That would need one single allocation for keep (in C or R) and then the C loop would loop through the columns updating keep whenever it saw an NA. The C could be done in Rcpp, in RStudio, inline package, or old school. It's important the two loops are that way round, for cache efficiency. The thinking is that the keep[i] && part helps speed when there are a lot of NA in some rows, to save even fetching the later column values at all after the first NA in each row.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
2

Two more approaches

two vector scans

x[!is.na(var1) & !is.na(var2)]

join with unique combinations of non-NA values

If you know the possible unique values in advance, this will be the fastest

system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])

Some timings

x <-data.table(var1 = sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
                var2= sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
                key = c('var1','var2'))

system.time(x[rowSums(is.na(x[, ..varcols])) == 0, ])
   user  system elapsed 
   0.09    0.02    0.11 

 system.time(x[!is.na(var1) & !is.na(var2)])
   user  system elapsed 
   0.06    0.02    0.07 


 system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
   user  system elapsed 
   0.03    0.00    0.04 
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
mnel
  • 113,303
  • 27
  • 265
  • 254