-1

I am writing an R code that selects top 2 movies for each user, among this 10, maximum of 1 are sponsored movies. The data is sorted based on user rating as follow:

user   movie  rating  sponsored
10     m23    3.4     1
2      m5     3.3     0
6      m74    3.3     1
10     m3     3.2     0
6      m2     3.1     0
10     m54    3.0     1
6      m13    2.8     0
2      m74    2.6     1
2      m12    2.5     0

Now since I have to sort based on rating in general, not withing each user, I was wondering how I hold variables like the number of movies within each user(K = 2) and max number of sponsored movies(S = 1) for each user? Should I create different tables for each user with their 2 movies? And if yes, how? The following is basically my algorithm:

n: number of users
m: number of movies

for(i in 1:nm){
    if(K_u_i < 2 && S_u_i <= 1)
    add that movie to top 2 list of that user
}

Please let me know if any further clarification is needed.

Thank you

user2843669
  • 123
  • 1
  • 1
  • 9
  • 4
    Could you provide a sample output? – pooja p Jan 07 '19 at 19:46
  • The output would be a table with only top 10 user-movies for each user. – user2843669 Jan 07 '19 at 19:47
  • 2
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. What is the exact output you want for this input above. No user even has 10 movies in the sample data so it looks like it would just return the whole thing. – MrFlick Jan 07 '19 at 19:48
  • Yes I am sorry, I just included a few rows of the data, but the data is about 1,000,000 rows with 2000 users – user2843669 Jan 07 '19 at 19:51
  • 1
    Well, don't post your real data. Just post a minimal reproducible example that can be used to help you. We can assume that a solution that works on your sample data will work on your real data. – MrFlick Jan 07 '19 at 20:09
  • I have edited the question, hope it a bit clearer now. thanks for the suggestions – user2843669 Jan 07 '19 at 20:34

1 Answers1

1

I'm not claiming that this is the only way, or a very elegant way, but this should work (hard to know without being able to test against a slightly larger dataset, though). Basic approach: First, create a subset containing only sponsored films, cutting them out of the original data. Cut the subset to the top film per user. Append that back to dataset of non-sponsored films. Now take the top 2 films for each user from the appended dataset.

> dat<- data.frame(user = c(10, 2, 6, 10, 6, 10, 6, 2, 2), 
+                  movie = c('m23', 'm5', 'm74', 'm3', 'm2', 'm54', 'm13', 'm74', 'm12'),
+                  rating = c(3.4, 3.3, 3.3, 3.2, 3.1, 3.0, 2.8, 2.6, 2.5),
+                  sponsored = c(1, 0, 1, 0, 0, 1, 0, 1, 0))
> 
> spons <- subset(dat, sponsored == 1)
> non_spons <- subset(dat, sponsored == 0)
> 
> spons <- spons[order(spons$user, spons$rating, decreasing = TRUE), ]
>   spons <- spons %>% group_by(user) %>% slice(1) %>%
+   ungroup()
> 
> new_dat <- rbind(spons, non_spons)
> 
> new_dat <- new_dat[order(new_dat$user, new_dat$rating, decreasing = TRUE), ]
> new_dat <- new_dat %>% group_by(user) %>% slice(1:2) %>%
+   ungroup()
> new_dat <- new_dat %>% group_by(user) %>% slice(1:2) %>%
+   ungroup() %>% print()
# A tibble: 6 x 4
   user movie rating sponsored
  <dbl> <fct>  <dbl>     <dbl>
1     2 m5       3.3         0
2     2 m74      2.6         1
3     6 m74      3.3         1
4     6 m2       3.1         0
5    10 m23      3.4         1
6    10 m3       3.2         0

Edit: the code I provided did not work, partly because I tried using dplyr without a lot of experience with that package. This is a hackier solution, but it works with the data provided.

Joseph Clark McIntyre
  • 1,094
  • 1
  • 6
  • 6