1

I have CSV file. CSV file has user_id, Movie1,Movie2,Movie3,...,Movie250 columns. Please refer CSV file image for reference. Each user gave different rating to movies. Also data contains many NA values. We can't omit the NA values otherwise we will lose may valuable data.At the same time NA will not allow to calculate correct average rating.

I need to solve following queries

  • Which movies have maximum views/ratings?
  • What is the average rating for each movie?
  • Define the top 5 movies with the maximum ratings.Define the top 5 movies with the least audience.
sdp
  • 23
  • 3
  • [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making an R question that folks can help with. That includes a sample of data (not a picture of it) and all necessary code. Also keep in mind the *minimal* part of [mcve]. This seems like a homework problem; have you tried any of it yet? – camille Mar 13 '20 at 18:17

1 Answers1

0

We can use pmax to get the max`

do.call(pmax, c(df1[-1], na.rm = TRUE))

Or to find the movie with max rating

apply(df1[-1], 1, function(x) names(x)[which.max(x)])

The average rating for each movie, we can find with colMeans

colMeans(df1[-1], na.rm = TRUE)

To get the top 5 movies, one option is to convert to 'long' format

library(dplyr)
library(tidyr)
df1 %>%
    summarise_at(vars(starts_with('Movie')), max, na.rm = TRUE) %>%
    pivot_longer(everything(), values_drop_na = TRUE) %>%
    top_n(5, wt = value)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Error in pivot_longer(., everything(), values_drop_na = TRUE) : could not find function "pivot_longer" – sdp Mar 13 '20 at 18:08
  • @sdp It is the `tidyr` `packageVersion('tidyr')# [1] ‘1.0.0’` `pivot_longer` is from that version. Otherwisee, use `gather(na.rm = TRUE)` – akrun Mar 13 '20 at 18:09
  • Is this correct? df %>% summarise_at(vars(starts_with('Movie')), max, na.rm = TRUE) %>% gather(na.rm = TRUE) %>% top_n(value, 5) – sdp Mar 13 '20 at 18:24
  • @sdp what is the error message. Yes, it is correct. Do you get the same message `could not find function pivot_longer`. if that is case, check the `packageVersion('tidyr')` – akrun Mar 13 '20 at 18:24
  • @sdp what is the `packageVersion('tidyr')` have you loaded `library(tidyr)`. if the verssion. is below 1.0.0, it won't work bcz the function was added recently – akrun Mar 13 '20 at 18:27
  • @sdp then, it shouldn't show the error that pivot_longer not found. Can you try on a fresh R session – akrun Mar 13 '20 at 18:37
  • @sdp Do you get the warning with `df1 %>% summarise_at(vars(starts_with('Movie')), max, na.rm = TRUE) %>% pivot_longer(everything(), values_drop_na = TRUE)` – akrun Mar 13 '20 at 18:48
  • @sdp I think the argument for `top_n` would be `top_n(5, wt = value)` sorry, I couldn't test it without a repro example – akrun Mar 13 '20 at 18:49
  • @sdp aree you getting warning with this step `df1 %>% summarise_at(vars(starts_with('Movie')), max, na.rm = TRUE) %>% pivot_longer(everything(), values_drop_na = TRUE)` or when adding the `top_n` – akrun Mar 13 '20 at 18:51
  • do.call(pmax, c(df[-1], na.rm = TRUE)) apply(df[-1], 1, function(x) names(x)[which.max(x)]) . NOT WORKING FOR QUESTION 1. Is there any way to store the output of colMeans(df[-1], na.rm = TRUE) to csv file – sdp Mar 13 '20 at 18:56
  • @sdp the `apply ` with `MARGIN = 1` loops over the rows find the column name where the value is max for each row – akrun Mar 13 '20 at 19:01
  • @sdp suppose you have data `df1 <- data.frame(cats = 1:3, dogs = c(1, 5, 3), horse = c(2, 3, 4))` what would be the expected output – akrun Mar 13 '20 at 19:06