1

I had categorical variables, which I converted to dummy variables and got over 2381 variables. I won't be needing that many variables for analysis (say regression or correlation). I want to remove columns if over 90% of the total values in a given column is '0'. Also, is there a good metric to remove columns other than 90% of values being '0' ? Help!

  • 2
    Can you show us an example of your data? Read here: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – RLave Dec 18 '18 at 08:07
  • 1
    Here is a step-by-step guide on how to do that: https://stackoverflow.com/questions/53646975/subset-dataframe-based-on-number-of-observations-in-each-column/53647417#53647417 Instead of `colSums` you'd need `colMeans`. Start with `dat == 0`. – markus Dec 18 '18 at 08:11
  • 1
    Please provide what you have tried already. – lsmor Dec 18 '18 at 08:17

3 Answers3

2

This will give you a data.frame without the columns where more than 90% of the elements are 0:

df[sapply(df, function(x) mean(x == 0) <= 0.9)]

Or more elgantly as markus suggests:

df[colMeans(df == 0) <= 0.9]
s_baldur
  • 29,441
  • 4
  • 36
  • 69
1

This is easily done with colSums:

Example data:

df <- data.frame(x = c(rep(0, 9), 1),
                 y = c(rep(0,9), 1),
                 z = c(rep(0, 8), 1, 1))

> df
   x y z
1  0 0 0
2  0 0 0
3  0 0 0
4  0 0 0
5  0 0 0
6  0 0 0
7  0 0 0
8  0 0 0
9  0 0 1
10 1 1 1

df[, colSums(df == 0)/nrow(df) < .9, drop = FALSE]
   z
1  0
2  0
3  0
4  0
5  0
6  0
7  0
8  0
9  1
10 1

Regarding the question about a useful metric, this heavily depends on what you want to analyze. Even a column with above 90 % 0 values may be useful for a regression model. I would look at the content of the variable, or use a stepwise exclusion based on AIC or BIC to measure the relevance of your variables.

LAP
  • 6,605
  • 2
  • 15
  • 28
0

Hy, I wrote some code with the dplyr package. Here is some example how you can ged rid of columns with more than 90% of zeros in it:

library(dplyr)

df <- data.frame(colA=sample(c(0,1), 100, replace=TRUE, prob=c(0.8,02)),
                 colB=sample(c(0,1), 100, replace=TRUE, prob=c(0.99,001)),
                 colC=sample(c(0,1), 100, replace=TRUE, prob=c(0.5,05)),
                 colD=sample(c(0,1), 100, replace=TRUE, prob=c(0,1)),
                 colE=rep(0, 100))

fct <- function (x) x==0

zero_count <- df %>% mutate_all(funs(fct)) %>% summarise_all(sum)

col_filter <- zero_count <= 0.9 * nrow(df)

df_filter <- df[, col_filter]
Freakazoid
  • 490
  • 3
  • 10