How can I count row of different columns with a condition

Question

I need to count the number of rows that have ratings of 4 or greater than 4 for each movie(column). And then Divide it by the total count of ratings. How can this be done ? Look at the image below for a brief idea.

Table

The final result should be something like

0.7000000, 'The Shawshank Redemption'
0.5333333, 'Star Wars IV - A New Hope'
0.5000000, 'Gladiator'
0.4444444, 'Blade Runner'
0.4375000, 'The Silence of the Lambs'

Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. — Jaap, Feb 19 '17 at 17:23

Karthik Arumugham · Answer 1 · 2017-02-19T19:09:46.663

1

The data is not in the usual tidy format. df is your dataframe with some temp values.

library(dplyr)
df <- data_frame(user = letters[1:10],
           m1 = c(1,5,NA,NA,4,2,NA,4,5,4),
           m2 = c(5,3,NA,3,3,4,NA,NA,1,2),
           m3 = c(2,NA,NA,NA,4,4,3,NA,NA,NA))
df
# A tibble: 10 × 4
#    user    m1    m2    m3
#    <chr> <dbl> <dbl> <dbl>
#1      a     1     5     2
#2      b     5     3    NA
#3      c    NA    NA    NA
#4      d    NA     3    NA
#5      e     4     3     4
#6      f     2     4     4
#7      g    NA    NA     3
#8      h     4    NA    NA
#9      i     5     1    NA
#10     j     4     2    NA

Let's convert it to a key:value pair i.e. movie:rating, in this case.

library(tidyr)    
df <- gather(df, movie, rating, -user)
df
# A tibble: 30 × 3
#    user movie rating
#    <chr> <chr>  <dbl>
#1      a    m1      1
#2      b    m1      5
#3      c    m1     NA
#4      d    m1     NA
#5      e    m1      4
#6      f    m1      2
#7      g    m1     NA
#8      h    m1      4
#9      i    m1      5
#10     j    m1      4
# ... with 20 more rows

Now it becomes easy to summarise.

df %>% group_by(movie) %>% summarise(countp = mean(rating>=4, na.rm=T))
# A tibble: 3 × 2
#    movie    countp
#    <chr>    <dbl>
#1    m1 0.7142857
#2    m2 0.2857143
#3    m3 0.5000000

edited Feb 19 '17 at 19:09

answered Feb 19 '17 at 18:01

Karthik Arumugham

1,300
1
11
18

Hey, your solution is correct ! Thanks for the help. But there is small issue. When i use: summarise(Hello = sum(rating>=4, na.rm=TRUE)/n()) n() gives the count as 20 which the total number users. I need to divide it by the count of ratings for that specific movie. As some users have not rated for some movies. – vr9211 Feb 19 '17 at 18:28
@vr9211: see my second/Psidom's solution: use `mean(...,na.rm=T)` instead of `sum(...,na.rm=T)/n()`. PS: As this is your 1st question: Don't forget to upvote the answer you accepted (or any other answer you consider helpful). Welcome to the site! – mschilli Feb 19 '17 at 18:34
@vr9211 The `group_by(movie)` would take care of that. You can create new cols to verify `summarise(count = n())` – Karthik Arumugham Feb 19 '17 at 18:35
@KarthikArumugham: interesting...I prefer doing things without extra packages if possible (your code *e.g.* doesn't run for me as there seems to be no `gather` function in my `dplyr` package), but this grouping feature sure can be handy. – mschilli Feb 19 '17 at 18:40
@mschilli Thanks for pointing out. Its actually part of `tidyr` package. I missed it. – Karthik Arumugham Feb 19 '17 at 18:47
@KarthikArumugham: thx for correcting +1 for teaching my the the 'tidy way' ;) – mschilli Feb 19 '17 at 19:02
@vr9211 Use `mean()` instead of `sum()/n()`, since na.rm=T needs to work. – Karthik Arumugham Feb 19 '17 at 19:12

score 0 · Answer 2 · answered Feb 19 '17 at 17:39

You can use colMeans to calculate the percentage and stack the result to long format:

Example data frame:

df = data.frame(user = c("A", "B", "C", "D"), 
                movieA = c(4,2,NA,5), 
                movieB = c(1,1,NA,4))

stack(colMeans(df[-1] >= 4, na.rm = T))

#     values    ind
#1 0.6666667 movieA
#2 0.3333333 movieB

To see how this works:

df[-1] >= 4                       # returns a boolean matrix where ratings >= 4 gives TRUE

#     movieA movieB
#[1,]   TRUE  FALSE
#[2,]  FALSE  FALSE
#[3,]     NA     NA
#[4,]   TRUE   TRUE

And the average of a boolean vector is the percentage of TRUEs(with NA removed), so calculate the average for all columns with colMeans will give you the percentage you need.

mschilli · Accepted Answer · 2017-02-19T17:58:39.580

ratings<-data.frame(User=c("John","Maria","Anton","Roger","Martina","Ana","Sergi","Marc","Jim","Chris")
                   ,Star.Wars.IV...A.New.Hope=c(1,5,NA,NA,4,2,NA,4,5,4)
                   ,Star.Wars.VI...Return.of.the.Jedi=c(5,3,NA,3,3,4,NA,NA,1,2)
                   ,Forrest.Gump=c(2,NA,NA,NA,4,4,3,NA,NA,2)
                   )
ratings

      User Star.Wars.IV...A.New.Hope Star.Wars.VI...Return.of.the.Jedi Forrest.Gump
1     John                         1                                 5            2
2    Maria                         5                                 3           NA
3    Anton                        NA                                NA           NA
4    Roger                        NA                                 3           NA
5  Martina                         4                                 3            4
6      Ana                         2                                 4            4
7    Sergi                        NA                                NA            3
8     Marc                         4                                NA           NA
9      Jim                         5                                 1           NA
10   Chris                         4                                 2            2

If you want to include the NAs in the total ratings count:

colSums(ratings[,-1]>=4,na.rm=T)/nrow(ratings)

        Star.Wars.IV...A.New.Hope Star.Wars.VI...Return.of.the.Jedi                      Forrest.Gump
                              0.5                               0.2                               0.2

If you want to exclude the NAs from the total ratings count:

colMeans(ratings[,-1]>=4,na.rm=T)

        Star.Wars.IV...A.New.Hope Star.Wars.VI...Return.of.the.Jedi                      Forrest.Gump
                     0.7142857143                      0.2857142857                      0.4000000000

How can I count row of different columns with a condition

3 Answers3