Select rows in data frame that are mostly non-zero

Question

I am trying to extract meaningful rows in a data large data frame with 2k column and 1m rows in which many column have value equals to zero. Meaningful is defined as most columns are value other than zero.

As toy example

difference_from_mean <- data.frame('col1' = c(-1, 1.5, -1, 1.2, 1), 'col2' = c(1, -0.5, 0, -4, 0), 'col3' = c(0, 1, 0, 1, 0), 'col4' = c(0, 0, 2, 1, 0))

difference_from_mean
  col1 col2 col3 col4
1 -1.0  1.0    0    0
2  1.5 -0.5    1    0
3 -1.0  0.0    0    2
4  1.2 -4.0    1    1
5  1.0  0.0    0    0

prefer to get the result

> difference_from_mean_filtered
  col1 col2 col3 col4
1 -1.0  1.0    0    0
2  1.5 -0.5    1    0
3 -1.0  0.0    0    2
4  1.2 -4.0    1    1

I tried rowSums but did not work as the value could be negative resulting zero or near zero to many raws. Above is just a toy example. I am looking to get the count of column with zero in to a new column and that would help in subsetting the df (tried the string match and it also did not work as it is counting 0 from the decimal).

Maybe you wanted `rowSums(difference_from_mean == 0)` to get the number of columns for each row that have a zero value. Not exactly sure what your definition of "most" columns here is. — MrFlick, Dec 18 '19 at 19:33
Thats perfect. I looked that from https://stackoverflow.com/questions/18862114/count-number-of-columns-by-a-condition-for-each-row but got confused with the explanation — shams, Dec 18 '19 at 19:36
Duplicate of [Count number of columns by a condition (>) for each row](https://stackoverflow.com/questions/18862114/count-number-of-columns-by-a-condition-for-each-row) — M--, Dec 18 '19 at 19:37
The question was closed but since I worked on the solution I will post it here lol: library(matrixStats) library(dplyr) difference_from_mean %>% filter(rowCounts(as.matrix(difference_from_mean), value = 0) / ncol(difference_from_mean) <= .5) — Rafael Neves, Dec 18 '19 at 19:52

Select rows in data frame that are mostly non-zero

0 Answers0