Remove all columns or rows with only zeros out of a data frame

Question

I have a question to NLP in R. My data is very big and so I need to reduce my data for further analysis to apply a SVM on it.

I have a Document-Term-Matrix like this:

Document WordY WordZ WordV WordU WordZZ
1        0     0     0     1     0
2        0     2     1     2     0
3        0     0     1     1     0

So in this example I would like to reduce the dataframe by column WordY and WordZZ because this columns have no specific meaning for this dataframe. Is this possible to remove all column with only zero values with one specific order? My problem is that my dataframe is too huge to delete every specific column with one order. Its something about 4.0000.0000 columns in the dataframe.

Thank you in Advance guys. Cheers, Tom

Why groups of 4 zeros in `4.0000.0000`? – Rui Barradas Jun 06 '21 at 17:22 — Rui Barradas, Jun 06 '21 at 17:22

Waldi · Answer 1 · 2021-06-06T18:13:28.753

You could also use sapply:

df <- read.table(text=
"Document WordY WordZ WordV WordU WordZZ
1        0     0     0     1     0
2        0     2     1     2     0
3        0     0     1     1     0",header=T)


df[,sapply(df,function(x) any(x!=0))]

  Document WordZ WordV WordU
1        1     0     0     1
2        2     2     1     2
3        3     0     1     1

Performance comparison:

Unit: microseconds
                                      expr      min        lq      mean    median        uq      max neval
 df[, sapply(df, function(x) any(x != 0))]  156.401  190.9515  236.3650  225.5510  271.0005  371.201   100
                df[, colSums(abs(df)) > 0]  345.601  398.6005  555.2809  451.8010  506.8005 6005.601   100
        dplyr::select_if(df, ~any(. != 0)) 2282.301 2620.9015 2939.9239 2773.1510 3019.9005 6588.402   100
 df[, `:=`(which(colSums(df) == 0), NULL)]  223.201  262.4015  337.5781  297.9015  352.2020 2528.900   100

VitaminB16 · Accepted Answer · 2021-06-08T21:34:06.650

3

Using colSums():

df[, colSums(abs(df)) > 0]

i.e. a column has only zeros if and only if the sum of the absolute values is zero.

edited Jun 08 '21 at 21:34

answered Jun 06 '21 at 17:43

VitaminB16

1,174
1
3
17

If a column only has zeros then why the absolute? – Rui Barradas Jun 06 '21 at 18:11
2

It might have `1, -1`, then the sum will still be zero – VitaminB16 Jun 06 '21 at 18:12

score 2 · Answer 3 · answered Jun 06 '21 at 17:25

2

Here is how I would do it:

dplyr::select_if(YOUR_DATA, ~ any(. != 0))

Returns:

  Document WordZ WordV WordU
1        1     0     0     1
2        2     2     1     2
3        3     0     1     1

answered Jun 06 '21 at 17:25

ktiu

2,606
6
20

score 1 · Answer 4 · answered Jun 06 '21 at 18:17

Another tidyverse solution. select_if is superseded by the following useage of select and where.

library(tidyverse)

dat2 <- dat %>%
  select(where(~any(. != 0)))
dat2
#   Document WordZ WordV WordU
# 1        1     0     0     1
# 2        2     2     1     2
# 3        3     0     1     1

Data

dat <- read.table(text = "Document WordY WordZ WordV WordU WordZZ
1        0     0     0     1     0
2        0     2     1     2     0
3        0     0     1     1     0",
                  header = TRUE)

Rui Barradas · Answer 5 · 2021-06-06T18:07:22.500

0

This question is a simpler version of this other SO question. Here is code inspired in the accepted answer.

df1[, which(colSums(df1) == 0) := NULL]

Data creation code

set.seed(2021)
df1 <- replicate(5, rbinom(10, 1, 0.5))
df1 <- as.data.table(df1)
df1[, 3] <- 0

edited Jun 06 '21 at 18:07

answered Jun 06 '21 at 17:40

Rui Barradas

70,273
8
34
66

@Waldi Yes, to assign `NULL` to the column numbers in the LHS. – Rui Barradas Jun 06 '21 at 18:09
OK, just saw you modified tags – Waldi Jun 06 '21 at 18:11
@Waldi The tags nlp and e1071 have nothing to do with the *question*. They might have to do with *part of the problem* the Op is trying to solve but they are not needed for this question. – Rui Barradas Jun 06 '21 at 18:15
Thanks for clarification, I agree. I meant that you added `data.table` which is also OK. – Waldi Jun 06 '21 at 18:21

Remove all columns or rows with only zeros out of a data frame

5 Answers5

Data creation code