-2

I need to calculate the percentage of zeros for each row in a data frame and discard the rows that have a percentage higher than a given threshold (60%). I figured I could add the values as a new variable with mutate() but I still don't know how to calculate them in the first place, since the number of columns is very large. Any suggestion?

anna
  • 11
  • 4
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Mar 11 '20 at 15:55

3 Answers3

1

We can write a little function to test for the sum of 0's and then apply() it to remove all desired rows using Base R:

## sampling data ##

set.seed(82)
df <- data.frame(a = sample(c(0,1,2,3), 10, replace = T), 
                 b = sample(c(0,1,2,3), 10, replace = T), 
                 c = sample(c(0,1,2,3), 10, replace = T),
                 d = sample(c(0,1,2,3), 10, replace = T), 
                 e = sample(c(0,1,2,3), 10, replace = T))

## function to find rows ##

row.discard <- function(vec, tresh = 0.1){
  t <- sum(vec == 0) / length(vec)
    if(t > tresh){
    T
    }
    else{
    F
    }
}

## apply to our df ##

ind <- apply(df, 1, row.discard)

## result ##

df[!ind,]

  a b c d e
1 3 2 2 3 2
5 2 1 1 2 1
6 1 2 3 3 3
7 1 3 3 1 2

Note: Since we are using apply() to get the index, thus not applying it to the df itself we will not be bothered by the conversion to matrix that is inherent with the usage of the apply() function.

fabla
  • 1,806
  • 1
  • 8
  • 20
0

Using apply function, you can pass a function per row and request the sum of 0 then divide by the total and multiply by 100 and you have your percentage.

With the following reproducible example:

df <- data.frame(t(data.frame(Row1 = sample(c(1,0),20,replace = TRUE),
                              Row2 = sample(c(1,0),20,replace = TRUE))))

     X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20
Row1  1  1  0  0  1  0  0  0  1   1   1   1   0   1   1   1   1   1   0   0
Row2  0  0  1  0  0  0  0  1  1   0   1   1   0   0   1   1   1   0   1   0

You can obtain the count of 0 and their expression as percentage by doing:

# Count of 0
apply(df,1, function(x) sum(x == 0))

Row1 Row2 
   8   11

# Count of 0 expressed as percentage
apply(df,1, function(x) sum(x == 0)/ncol(df)*100)

Row1 Row2 
  40   55 

Finally, if you want to extract rows with a certain percentage of 0 (let's say above 41%), we can do:

test <- apply(df,1, function(x) sum(x == 0)/ncol(df)*100)

df[test > 41,]
     X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20
Row2  0  0  1  0  0  0  0  1  1   0   1   1   0   0   1   1   1   0   1   0

Does it answer your question ?

dc37
  • 15,840
  • 4
  • 15
  • 32
0
percent0 <- apply(myDF, 1, function (x) { sum(x==0) / length(x) })
myDF <- myDF[percent0 < 0.6,]
Arnold Cross
  • 199
  • 1
  • 12