I had categorical variables, which I converted to dummy variables and got over 2381 variables. I won't be needing that many variables for analysis (say regression or correlation). I want to remove columns if over 90% of the total values in a given column is '0'. Also, is there a good metric to remove columns other than 90% of values being '0' ? Help!
Asked
Active
Viewed 1,299 times
1
-
2Can you show us an example of your data? Read here: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – RLave Dec 18 '18 at 08:07
-
1Here is a step-by-step guide on how to do that: https://stackoverflow.com/questions/53646975/subset-dataframe-based-on-number-of-observations-in-each-column/53647417#53647417 Instead of `colSums` you'd need `colMeans`. Start with `dat == 0`. – markus Dec 18 '18 at 08:11
-
1Please provide what you have tried already. – lsmor Dec 18 '18 at 08:17
3 Answers
2
This will give you a data.frame
without the columns where more than 90% of the elements are 0
:
df[sapply(df, function(x) mean(x == 0) <= 0.9)]
Or more elgantly as markus suggests:
df[colMeans(df == 0) <= 0.9]

s_baldur
- 29,441
- 4
- 36
- 69
1
This is easily done with colSums
:
Example data:
df <- data.frame(x = c(rep(0, 9), 1),
y = c(rep(0,9), 1),
z = c(rep(0, 8), 1, 1))
> df
x y z
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 1 1 1
df[, colSums(df == 0)/nrow(df) < .9, drop = FALSE]
z
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 1
10 1
Regarding the question about a useful metric, this heavily depends on what you want to analyze. Even a column with above 90 % 0
values may be useful for a regression model. I would look at the content of the variable, or use a stepwise exclusion based on AIC or BIC to measure the relevance of your variables.

LAP
- 6,605
- 2
- 15
- 28
0
Hy,
I wrote some code with the dplyr
package. Here is some example how you can ged rid of columns with more than 90% of zeros in it:
library(dplyr)
df <- data.frame(colA=sample(c(0,1), 100, replace=TRUE, prob=c(0.8,02)),
colB=sample(c(0,1), 100, replace=TRUE, prob=c(0.99,001)),
colC=sample(c(0,1), 100, replace=TRUE, prob=c(0.5,05)),
colD=sample(c(0,1), 100, replace=TRUE, prob=c(0,1)),
colE=rep(0, 100))
fct <- function (x) x==0
zero_count <- df %>% mutate_all(funs(fct)) %>% summarise_all(sum)
col_filter <- zero_count <= 0.9 * nrow(df)
df_filter <- df[, col_filter]

Freakazoid
- 490
- 3
- 10