Removing columns that satisfy a condition from a big data.frame in R

Question

I have a big data.frame; 100,000 observations of 700 variables.

Most of the variables have actually value 0 in all the observations, and I would like to remove that variables/columns.

I tried the following,

data <- data[apply(data, 2, function(x){all(x == 0)})]

But the apply took a lot of time to resolve.

I tried a while, in case the problem was working with all data at once.

i <- 1
while (i <= ncol(data)) {
  if (all(data[i] == 0)) {
    data[i] <- NULL
  } else {
    i <- i+1
  }
}

But I kept having the same problem, it took a lot.

So,

Why does that operation take THAT long? Even though the data.frame is big, the operation is pretty simple.

and, above all

Is there any way to do this faster?

My guess is R is copying your data back and forth, see this post http://stackoverflow.com/questions/16943939/elegantly-assigning-multiple-columns-in-data-table-with-lapply for a possible solution with `data.table`. — m-dz, Mar 30 '17 at 10:40
If `data`only has numerical entries you could try something like `data[,-which(colSums(abs(data)) < 1e-16)]`. — ikop, Mar 30 '17 at 10:55

score 1 · Accepted Answer · edited May 23 '17 at 12:03

Your question is confusing. I assume you want to remove variables, i.e., columns. You can use any with automatic coercion of values to type logical. The usual warnings regarding comparison of floating point numbers apply. If you want to play it safe, you'll need to test whether the doubles are smaller than some precision value, which will be slower, but getting it right is often more important.

DF <- data.frame(x = 1:3, y = 1:3/10, z = 0)
DF[] <- lapply(DF, function(x) if (any(x)) x else NULL)
#Warning messages:
#1: In any(x) : coercing argument of type 'double' to logical
#2: In any(x) : coercing argument of type 'double' to logical
#  x   y
#1 1 0.1
#2 2 0.2
#3 3 0.3

set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))

system.time(DF2[] <- lapply(DF2, function(x) if (any(x)) x else NULL))
#user  system elapsed 
#0.10    0.02    0.11

Safer option:

set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))

system.time(DF2[] <- lapply(DF2, function(x) if (any(x > 1e-16)) x else NULL))
#user  system elapsed 
#0.34    0.11    0.45

Edited to clarify that I want the remove the variables. – Masclins Mar 30 '17 at 10:53 — Masclins, Mar 30 '17 at 10:53

score 1 · Answer 2 · answered Mar 30 '17 at 10:55

1

Using vectorized operation like colSums speeds up the operation on my machine -

> set.seed(123)
> df = data.frame(matrix(sample(0:1,100000*700,replace = T,prob = c(0.9999999,0.0000001)), ncol = 700))
> system.time(df1 <- df[apply(df, 2, function(x){all(x == 0)})])
user  system elapsed 
1.386   0.821   2.225 
> system.time(df2 <- df[,which(colSums(df)==0)])
user  system elapsed 
0.243   0.082   0.326 
> identical(df1, df2)
[1] TRUE

answered Mar 30 '17 at 10:55

Nishanth

6,932
5
26
38

This ends up being useful for me, even though suming might not be a solution when there are negative values. – Masclins Mar 30 '17 at 12:53

Removing columns that satisfy a condition from a big data.frame in R

2 Answers2