3

I have a question to NLP in R. My data is very big and so I need to reduce my data for further analysis to apply a SVM on it.

I have a Document-Term-Matrix like this:

Document WordY WordZ WordV WordU WordZZ
1        0     0     0     1     0
2        0     2     1     2     0
3        0     0     1     1     0

So in this example I would like to reduce the dataframe by column WordY and WordZZ because this columns have no specific meaning for this dataframe. Is this possible to remove all column with only zero values with one specific order? My problem is that my dataframe is too huge to delete every specific column with one order. Its something about 4.0000.0000 columns in the dataframe.

Thank you in Advance guys. Cheers, Tom

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
Sylababa
  • 65
  • 4

5 Answers5

3

You could also use sapply:

df <- read.table(text=
"Document WordY WordZ WordV WordU WordZZ
1        0     0     0     1     0
2        0     2     1     2     0
3        0     0     1     1     0",header=T)


df[,sapply(df,function(x) any(x!=0))]

  Document WordZ WordV WordU
1        1     0     0     1
2        2     2     1     2
3        3     0     1     1

Performance comparison:

Unit: microseconds
                                      expr      min        lq      mean    median        uq      max neval
 df[, sapply(df, function(x) any(x != 0))]  156.401  190.9515  236.3650  225.5510  271.0005  371.201   100
                df[, colSums(abs(df)) > 0]  345.601  398.6005  555.2809  451.8010  506.8005 6005.601   100
        dplyr::select_if(df, ~any(. != 0)) 2282.301 2620.9015 2939.9239 2773.1510 3019.9005 6588.402   100
 df[, `:=`(which(colSums(df) == 0), NULL)]  223.201  262.4015  337.5781  297.9015  352.2020 2528.900   100
Waldi
  • 39,242
  • 6
  • 30
  • 78
3

Using colSums():

df[, colSums(abs(df)) > 0]

i.e. a column has only zeros if and only if the sum of the absolute values is zero.

VitaminB16
  • 1,174
  • 1
  • 3
  • 17
2

Here is how I would do it:

dplyr::select_if(YOUR_DATA, ~ any(. != 0))

Returns:

  Document WordZ WordV WordU
1        1     0     0     1
2        2     2     1     2
3        3     0     1     1
ktiu
  • 2,606
  • 6
  • 20
1

Another tidyverse solution. select_if is superseded by the following useage of select and where.

library(tidyverse)

dat2 <- dat %>%
  select(where(~any(. != 0)))
dat2
#   Document WordZ WordV WordU
# 1        1     0     0     1
# 2        2     2     1     2
# 3        3     0     1     1

Data

dat <- read.table(text = "Document WordY WordZ WordV WordU WordZZ
1        0     0     0     1     0
2        0     2     1     2     0
3        0     0     1     1     0",
                  header = TRUE)
www
  • 38,575
  • 12
  • 48
  • 84
0

This question is a simpler version of this other SO question. Here is code inspired in the accepted answer.

df1[, which(colSums(df1) == 0) := NULL]

Data creation code

set.seed(2021)
df1 <- replicate(5, rbinom(10, 1, 0.5))
df1 <- as.data.table(df1)
df1[, 3] <- 0
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66