1

Supposing a data set with several rows and columns with some columns being 0 (I mean all values in the column are 0's). How one can filter out those columns? I have tried with the following code but no avail.

training_data <- Filer(function(x) { !(all(x[, 1:99]==0))}, training_data)

UPDATE:

Sorry. In the data set, NOT all columns are numeric, so I need to specify a range from 1:99 for columns.

UPDATE V2: Added a part of my data set (using dput)

structure(list(label = structure(c(1L, 1L, 1L, 2L, 1L, 1L), .Label = c("A", 
"B"), class = "factor"), f1 = c(15, 24, 10, 9, 6, 9), f2 = c(6, 
14, 5, 4, 2, 4), f3 = c(6, 7, 2, 2, 1, 2), f4 = c(0, 0, 0, 0, 
0, 0), f5 = c(9, 15, 6, 5, 3, 5), f6 = c(3, 7, 2, 2, 1, 2), f7 = c(1, 
0, 0, 0, 0, 0), f8 = c(4, 11, 5, 4, 2, 4), f9 = c(5, 3, 0, 0, 
0, 0), f10 = c(1, 3, 0, 0, 0, 0), f11 = c(1, 4, 2, 2, 1, 2), 
    f12 = c(0, 0, 0, 0, 0, 0), f13 = c(13, 15, 7, 6, 3, 6), f14 = c(0, 
    7, 1, 1, 1, 1), f15 = c(0, 0, 0, 0, 0, 0), f16 = c(20, 30, 
    11, 10, 6, 10), f17 = c(5, 0, 0, 0, 0, 0), f18 = c(0, 0, 
    0, 0, 0, 0), ft19 = c(28, 344, 399, 28, 82, 42), f20 = c(2.15, 
    15.64, 49.88, 4, 20.5, 6), f21 = c(0, 0, 0, 0, 0, 0), f22 = c(0, 
    0, 0, 0, 0, 0), f23 = c(6, 7, 2, 2, 1, 2), f24 = c(0, 0, 
    0, 0, 0, 0), f25 = c(19, 334, 395, 23, 79, 37), f26 = c(0, 
    26, 37, 6, 16, 7), f27 = c(11, 64, 101, 5, 17, 12), f28 = c(0, 
    0, 0, 0, 0, 0), f29 = c(2, 37, 101, 7, 26, 8), f30 = c(0, 
    18, 32, 2, 16, 4), f31 = c(0, 0, 0, 0, 0, 0), f32 = c(0, 
    0, 0, 0, 0, 0), f33 = c(3, 0, 1, 0, 1, 0), f34 = c(5, 44, 
    32, 4, 15, 5), f35 = c(0, 0, 0, 0, 0, 0), f36 = c(0, 0, 0, 
    0, 0, 0), f37 = c(0, 0, 0, 0, 0, 0), f38 = c(0, 0, 0, 0, 
    0, 0), f39 = c(6, 8, 10, 3, 2, 3), f40 = c(4, 6, 16, 4, 4, 
    3), f41 = c(18, 36, 37, 7, 5, 7), f42 = c(0, 18, 27, 0, 14, 
    1), f43 = c(0, 0, 0, 0, 0, 0), f44 = c(54, 743, 910, 65, 
    184, 100), f45 = c(14, 133, 91, 25, 18, 40), f46 = c(0, 0, 
    0, 0, 0, 0), f47 = c(4, 25, 17, 6, 6, 8), f48 = c(0, 0, 0, 
    0, 0, 0), f49 = c(0.46, 1, 1.5, 1.14, 1.5, 1.14), f50 = c(2.67, 
    1.86, 1.83, 1.88, 1.67, 1.88), f51 = c(3, 9, 1, 2, 1, 2), 
    f52 = c(0, 1, 2, 1, 1, 1), f53 = c(10, 12, 5, 4, 2, 4), f54 = c(0, 
    0, 0, 0, 0, 0), ft55 = c(3, 10, 3, 3, 2, 3), f56 = c(0.54, 
    0.07, 0.03, 0.32, 0.07, 0.21), f57 = c(0.21, 0.04, 0.01, 
    0.14, 0.02, 0.1), f58 = c(0.21, 0.02, 0.01, 0.07, 0.01, 0.05
    ), f59 = c(0, 0, 0, 0, 0, 0), f60 = c(0.32, 0.04, 0.02, 0.18, 
    0.04, 0.12), f61 = c(0.11, 0.02, 0.01, 0.07, 0.01, 0.05), 
    f62 = c(0.04, 0, 0, 0, 0, 0), f63 = c(0.14, 0.03, 0.01, 0.14, 
    0.02, 0.1), f64 = c(0.18, 0.01, 0, 0, 0, 0), f65 = c(0.04, 
    0.01, 0, 0, 0, 0), f66 = c(0.04, 0.01, 0.01, 0.07, 0.01, 
    0.05), f67 = c(0, 0, 0, 0, 0, 0), f68 = c(0.46, 0.04, 0.02, 
    0.21, 0.04, 0.14), f69 = c(0, 0.02, 0, 0.04, 0.01, 0.02), 
    f70 = c(0, 0, 0, 0, 0, 0), f71 = c(0.71, 0.09, 0.03, 0.36, 
    0.07, 0.24), f72 = c(0.18, 0, 0, 0, 0, 0), f73 = c(0, 0, 
    0, 0, 0, 0), f74 = c(1, 1, 1, 1, 1, 1), f75 = c(0.08, 0.05, 
    0.12, 0.14, 0.25, 0.14), f76 = c(0, 0, 0, 0, 0, 0), f77 = c(0, 
    0, 0, 0, 0, 0), f78 = c(0.21, 0.02, 0.01, 0.07, 0.01, 0.05
    ), f79 = c(0, 0, 0, 0, 0, 0), f80 = c(0.68, 0.97, 0.99, 0.82, 
    0.96, 0.88), f81 = c(0, 0.08, 0.09, 0.21, 0.2, 0.17), f82 = c(0.39, 
    0.19, 0.25, 0.18, 0.21, 0.29), f83 = c(0, 0, 0, 0, 0, 0), 
    f84 = c(0.07, 0.11, 0.25, 0.25, 0.32, 0.19), f85 = c(0, 0.05, 
    0.08, 0.07, 0.2, 0.1), f86 = c(0, 0, 0, 0, 0, 0), f87 = c(0, 
    0, 0, 0, 0, 0), f88 = c(0.11, 0, 0, 0, 0.01, 0), f89 = c(0.18, 
    0.13, 0.08, 0.14, 0.18, 0.12), f90 = c(0, 0, 0, 0, 0, 0), 
    f91 = c(0, 0, 0, 0, 0, 0), f92 = c(0, 0, 0, 0, 0, 0), f93 = c(0, 
    0, 0, 0, 0, 0), f94 = c(0.21, 0.02, 0.03, 0.11, 0.02, 0.07
    ), f95 = c(0.14, 0.02, 0.04, 0.14, 0.05, 0.07), f96 = c(0.64, 
    0.1, 0.09, 0.25, 0.06, 0.17), f97 = c(0, 0.05, 0.07, 0, 0.17, 
    0.02), f98 = c(0, 0, 0, 0, 0, 0), f99 = c(1.93, 2.16, 2.28, 
    2.32, 2.24, 2.38), f100 = c(0.5, 0.39, 0.23, 0.89, 0.22, 
    0.95), f101 = c(0, 0, 0, 0, 0, 0), f102 = c(0.14, 0.07, 0.04, 
    0.21, 0.07, 0.19), f103 = c(0, 0, 0, 0, 0, 0), f104 = c(0.02, 
    0, 0, 0.04, 0.02, 0.03), f105 = c(0.1, 0.01, 0, 0.07, 0.02, 
    0.04), f106 = c(0.11, 0.03, 0, 0.07, 0.01, 0.05), f107 = c(0, 
    0, 0.01, 0.04, 0.01, 0.02), f108 = c(0.36, 0.03, 0.01, 0.14, 
    0.02, 0.1), f109 = c(0, 0, 0, 0, 0, 0), f110 = c(0.11, 0.03, 
    0.01, 0.11, 0.02, 0.07)), .Names = c("label", "f1", "f2", 
"f3", "f4", "f5", "f6", "f7", "f8", "f9", "f10", "f11", "f12", 
"f13", "f14", "f15", "f16", "f17", "f18", "ft19", "f20", "f21", 
"f22", "f23", "f24", "f25", "f26", "f27", "f28", "f29", "f30", 
"f31", "f32", "f33", "f34", "f35", "f36", "f37", "f38", "f39", 
"f40", "f41", "f42", "f43", "f44", "f45", "f46", "f47", "f48", 
"f49", "f50", "f51", "f52", "f53", "f54", "ft55", "f56", "f57", 
"f58", "f59", "f60", "f61", "f62", "f63", "f64", "f65", "f66", 
"f67", "f68", "f69", "f70", "f71", "f72", "f73", "f74", "f75", 
"f76", "f77", "f78", "f79", "f80", "f81", "f82", "f83", "f84", 
"f85", "f86", "f87", "f88", "f89", "f90", "f91", "f92", "f93", 
"f94", "f95", "f96", "f97", "f98", "f99", "f100", "f101", "f102", 
"f103", "f104", "f105", "f106", "f107", "f108", "f109", "f110"
), class = "data.frame", row.names = c(NA, -6L))
Arun
  • 116,683
  • 26
  • 284
  • 387
Shahzad
  • 1,999
  • 6
  • 35
  • 44
  • Please use `dput` to paste the output here. What is this that you've pasted? – Arun Mar 05 '13 at 13:46
  • @Arun. Its a csv format my data set. – Shahzad Mar 05 '13 at 13:49
  • for the data you've posted: `label` is the only non-numeric column. So you could do: `df[-1][!colSums(df[-1] == 0) == nrow(df[-1])]` (check my answer). – Arun Mar 05 '13 at 14:01
  • @Arun. All are numeric except "label" column having value "A" or "B". But I have excluded some other non-numeric columns for the sake of simplicity. But still the goal is same to trim out columns where all values are 0 like f12. – Shahzad Mar 05 '13 at 14:04
  • 1
    yes, I understand that. And I've offered you a solution. I **really** don't know what is wrong with your problem anymore. What exactly are these answers lacking (or the one I've provided in mine and above in this comment)? – Arun Mar 05 '13 at 14:06
  • @Arun. I am testing your solution. – Shahzad Mar 05 '13 at 14:07
  • @Shahzad: You said your numeric columns are `1:99`, but in the data you `dput` the first column is a factor. In general you shouldn't use indices for this, since those can change, but test for `is.numeric` instead. Check my solution for an example. – adibender Mar 05 '13 at 15:27
  • @adibender. Please have a look at the comment in your answer. – Shahzad Mar 05 '13 at 16:33

6 Answers6

3
training_data[, !colSums(training_data == 0)]

Based on question update: (filter applied to columns 1 - 99)

idx <- which(as.logical(colSums(training_data[, 1:99] == 0))) # find columns
training_data[, setdiff(seq_along(test_data), idx)]           # exclude columns
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
  • I remember a comment from @hadley in a post as a reply to JDLong's answer about suggesting not to use `-which(.)`. I don't recall why.. and I am trying to get hold of that post. – Arun Mar 05 '13 at 12:33
  • 1
    @Arun I remember also , but here I am not even sure that solution is right, I test it on my data and I get an odd result. the Op don't give a reproducible example, so the question is confusing. – agstudy Mar 05 '13 at 12:37
  • 1
    @Arun compare these to see possible danger of negative subscripts: `c(2,3)[-which(1>c(2,3))]` compared to `c(2,3)[!(1>c(2,3))]`. Both try to extract values which are greater than one. – Jouni Helske Mar 05 '13 at 12:42
  • @Arun I think you refer to this: http://stackoverflow.com/questions/4605206/drop-columns-r-data-frame – Jack Ryan Mar 05 '13 at 12:42
  • @Hemmo, (+1) right on target. I'll have to write this down somewhere for another time! – Arun Mar 05 '13 at 12:57
  • @Arun Thanks for the comment. I wasn't aware of the problem of `numeric(0)`. I modified the answer accordingly. – Sven Hohenstein Mar 05 '13 at 13:36
2

You can use colSums

dat <- diag(10)
dat[1,1]  <- 0
dat[5,5]  <- 0

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    0    0    0    0    0    0    0    0    0     0
 [2,]    0    1    0    0    0    0    0    0    0     0
 [3,]    0    0    1    0    0    0    0    0    0     0
 [4,]    0    0    0    1    0    0    0    0    0     0
 [5,]    0    0    0    0    0    0    0    0    0     0
 [6,]    0    0    0    0    0    1    0    0    0     0
 [7,]    0    0    0    0    0    0    1    0    0     0
 [8,]    0    0    0    0    0    0    0    1    0     0
 [9,]    0    0    0    0    0    0    0    0    1     0
[10,]    0    0    0    0    0    0    0    0    0     1

colSums(dat) == 0
 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE

So to remove the columns with 0 , you just do this

dat[  ,colSums(dat)!=0]
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
 [1,]    0    0    0    0    0    0    0    0
 [2,]    1    0    0    0    0    0    0    0
 [3,]    0    1    0    0    0    0    0    0
 [4,]    0    0    1    0    0    0    0    0
 [5,]    0    0    0    0    0    0    0    0
 [6,]    0    0    0    1    0    0    0    0
 [7,]    0    0    0    0    1    0    0    0
 [8,]    0    0    0    0    0    1    0    0
 [9,]    0    0    0    0    0    0    1    0
[10,]    0    0    0    0    0    0    0    1

EDIT

This assume that all data have the same sign, to avoid this ,

dat[  ,colSums(abs(dat[,1:99]))!=0]
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • 3
    This only works if there is no negative values in the data, otherwise it can happen that the column sum is zero even though not all values in the column are zero. – Jouni Helske Mar 05 '13 at 12:12
  • I am sorry. I have to get back my words. The solution is still not working. – Shahzad Mar 05 '13 at 13:17
  • @Shahzad you don't tell why and how? can you add a reproducible example to your question and the expected result. – agstudy Mar 05 '13 at 13:21
  • @agstudy I have posted some part of my data. Please have a look. Thanks. – Shahzad Mar 05 '13 at 13:45
  • @agstudy: I think it doesn't work, because you build the logical index on `dat[, 1:99]`, but subset on `dat[, ...]`, not `dat[, 1:99][, ...]` – adibender Mar 05 '13 at 15:07
2
training_data[,apply(training_data, MARGIN = 2, FUN = function(x) !all(x == 0))]
Pavel Obraztcov
  • 338
  • 1
  • 6
2

Just another way using lapply as it is a data.frame. apply internally converts data.frame to a matrix I believe.

df[!unlist(lapply(df, function(x) all(x==0)))]

Or in your case:

df[, 1:99][!unlist(lapply(df[, 1:99], function(x) all(x==0)))]

Edit: Another way using colSums. The trick is to use it after checking for 0.

df[!colSums(df == 0) == nrow(df)]

If you know which columns are numeric (say, 1:99), then replace df with:

df[,1:99][!colSums(df[,1:99] == 0) == nrow(df)]
Arun
  • 116,683
  • 26
  • 284
  • 387
2

I think in the solutions using all(x == 0) it is slightly more efficient to use any(x!=0), because any stops after the first instance of an element being !=0, which will be important with growing number of rows.

To provide a different solution using plyr and colwise (dat being the dputdata):

library(plyr)
f0 <- function(x) any(x!=0) & is.numeric(x)
colwise(identity, f0)(dat)

The idea is to go through every column in dat and return it (identity), but only if f0 returns TRUE, i.e. the column has at least one entry !=0 and the column is.numeric

EDIT: To do this for every data.frame in your list, eg. training_data <- list(dat, dat, dat, dat)

training_data_clean <- lapply(training_data, function(z) colwise(identity, f0)(z))

sapply(training_data, dim)
     [,1] [,2] [,3] [,4]
[1,]    6    6    6    6
[2,]  111  111  111  111

sapply(training_data_clean, dim)
     [,1] [,2] [,3] [,4]
[1,]    6    6    6    6
[2,]   74   74   74   74

EDIT2: To retain the label column:

lapply(training_data, function(z) cbind(label = z$label, colwise(identity, f0)(z)))
adibender
  • 7,288
  • 3
  • 37
  • 41
  • Could you please provide a more thorough solution. I am not good at R. How to fit the above solution in trainingdata <- lapply(trainingdata, function(data) { } ). I have several data sets like these in the list trainingdata. Thanks. – Shahzad Mar 05 '13 at 15:49
  • The solution seems to work fine but how can I retain the first column i.e. "label" in trainingdata. – Shahzad Mar 05 '13 at 16:14
  • Great. Is it possible that somehow we can modify the code to put the condition that exclude those columns too which are all 1's (I mean all the values of the column are 1). – Shahzad Mar 05 '13 at 16:41
  • 1
    @Shahzad: this is straight forward by manipulation of `f0`, if you can't figure it out yourself, you should read a R tutorial. I think your original question has been answered sufficiently. You can ask a new question refering to this one, if new questions arise... – adibender Mar 05 '13 at 16:48
  • Thanks. I will post an other question. – Shahzad Mar 05 '13 at 16:50
  • @Shahzad: Ok, you should really figure out how to apply the solution to all elements `!=1` instead of `!=0` by yourself though ;-) – adibender Mar 05 '13 at 16:52
1
apply(df, 2, Filter, f = function(x){!all(x==0)})

I had the same question.

Community
  • 1
  • 1
Jack Ryan
  • 2,134
  • 18
  • 26
  • I think this is incomplete, because it returns a list with element numeric(0) when `all(x == 0)` instead of a data.frame containing only columns !all(x ==0) – adibender Mar 05 '13 at 15:37