0

I have a big data.frame; 100,000 observations of 700 variables.

Most of the variables have actually value 0 in all the observations, and I would like to remove that variables/columns.

I tried the following,

data <- data[apply(data, 2, function(x){all(x == 0)})]

But the apply took a lot of time to resolve.

I tried a while, in case the problem was working with all data at once.

i <- 1
while (i <= ncol(data)) {
  if (all(data[i] == 0)) {
    data[i] <- NULL
  } else {
    i <- i+1
  }
}

But I kept having the same problem, it took a lot.

So,

  • Why does that operation take THAT long? Even though the data.frame is big, the operation is pretty simple.

and, above all

  • Is there any way to do this faster?
Masclins
  • 230
  • 3
  • 19
  • My guess is R is copying your data back and forth, see this post http://stackoverflow.com/questions/16943939/elegantly-assigning-multiple-columns-in-data-table-with-lapply for a possible solution with `data.table`. – m-dz Mar 30 '17 at 10:40
  • If `data`only has numerical entries you could try something like `data[,-which(colSums(abs(data)) < 1e-16)]`. – ikop Mar 30 '17 at 10:55
  • Please give a [mcve] in your question! – jogo May 16 '17 at 08:47

2 Answers2

1

Your question is confusing. I assume you want to remove variables, i.e., columns. You can use any with automatic coercion of values to type logical. The usual warnings regarding comparison of floating point numbers apply. If you want to play it safe, you'll need to test whether the doubles are smaller than some precision value, which will be slower, but getting it right is often more important.

DF <- data.frame(x = 1:3, y = 1:3/10, z = 0)
DF[] <- lapply(DF, function(x) if (any(x)) x else NULL)
#Warning messages:
#1: In any(x) : coercing argument of type 'double' to logical
#2: In any(x) : coercing argument of type 'double' to logical
#  x   y
#1 1 0.1
#2 2 0.2
#3 3 0.3

set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))

system.time(DF2[] <- lapply(DF2, function(x) if (any(x)) x else NULL))
#user  system elapsed 
#0.10    0.02    0.11 

Safer option:

set.seed(42)
DF2 <- as.data.frame(matrix(sample(0:1, 700*1e5, TRUE, prob = c(0.999999, 0.000001)), ncol = 700))

system.time(DF2[] <- lapply(DF2, function(x) if (any(x > 1e-16)) x else NULL))
#user  system elapsed 
#0.34    0.11    0.45 
Community
  • 1
  • 1
Roland
  • 127,288
  • 10
  • 191
  • 288
1

Using vectorized operation like colSums speeds up the operation on my machine -

> set.seed(123)
> df = data.frame(matrix(sample(0:1,100000*700,replace = T,prob = c(0.9999999,0.0000001)), ncol = 700))
> system.time(df1 <- df[apply(df, 2, function(x){all(x == 0)})])
user  system elapsed 
1.386   0.821   2.225 
> system.time(df2 <- df[,which(colSums(df)==0)])
user  system elapsed 
0.243   0.082   0.326 
> identical(df1, df2)
[1] TRUE
Nishanth
  • 6,932
  • 5
  • 26
  • 38
  • This ends up being useful for me, even though suming might not be a solution when there are negative values. – Masclins Mar 30 '17 at 12:53