3

I had to clarify my question because some think that is unclar question. The question is, i have a list of four different types of data. Here is a short part of it to clearify my idea.

   UserID     movie_id  rating    unix_timestamp
    196        242         3       881250949
    186        302         3       891717742
    22         377         1       878887116
    244        51          2       880606923
    166        346         1       886397596
    298        474         4       884182806
    115        265         2       881171488
    253        465         5       891628467
    305        451         3       886324817
     7         451         5       891353892

Some monvies, the sum of ratings that they got from the users were 50 , 44 or 88 etc. For instanace, movie_id (451) got 3 and 5 ratings (so it got 8 ratings together ). I wanted to exclude those movies which got less than 50 ratings. and get the average of the other movies with more than 50 ratings(the sum of ratings that they got from the users) and show only the top 5 or 10 values.

Here is part pf the code

grouped_data = ratings['rating'].groupby(ratings['movie_id'])
## average and combine
average_ratings = grouped_data.mean()
print ("Average ratings:")
print (average_ratings.head())
  • Possible duplicate of [pandas: what is the equivalent of SQL group by having?](http://stackoverflow.com/questions/22105452/pandas-what-is-the-equivalent-of-sql-group-by-having) – Itay Nov 07 '16 at 13:25
  • I don't understand your question... are you asking how to limit the result to the first five? – Bakuriu Nov 07 '16 at 13:28
  • Can you show an example `ratings` with the desired output? – IanS Nov 07 '16 at 15:41

1 Answers1

2

Many ways to skin a cat as often in pandas, here are a couple:

1.Apply several functions to the groupby

Apply both mean and count to the groupby:

In [1]: df= ratings['rating'].groupby(ratings['movie_id']).agg(['mean', 'count'])
        df.head(3)
Out[1]: 
           mean     count
movie_id        
1          3.878319     452
2          3.206107     131
3          3.033333     90

Then you can filter it and return the 5 largest:

In [2]: df.ix[(df['count'] >= 50), 'mean'].nlargest(5)

Out[2]:
movie_id
408    4.491071
318    4.466443
169    4.466102
483    4.456790
114    4.447761
Name: mean, dtype: float64

2.Use boolean indexing after the fact

This assumes you have executed the entire code of your question, thus average_ratings is already existing

movie_count = ratings.movie_id.value_counts()
higher_than_50_votes = movie_count.index[movie_count > 50]
# Apply that to your average_ratings, sort, and return
average_ratings.ix[higher_than_50_votes].sort_values(ascending=False).head(5)

3. Using groupby.filter

ratings.groupby('movie_id').filter(lambda x: len(x) > 50).groupby('movie_id')['rating'].mean().sort_values(ascending=False).head(5)
Julien Marrec
  • 11,605
  • 4
  • 46
  • 63
  • 3
    use nlargest much more efficient that sorting and head (can also do this directly on groupby) – Jeff Nov 07 '16 at 14:10
  • Ha! I completely missed this one even though I've been using pandas for years now :) You learn something everyday, thanks! – Julien Marrec Nov 07 '16 at 14:13
  • What if i want to use (Mr Julien Marrec solution ) but exclude some users´ rating, the user id refers to users with different names so is it possible to exclude some names from the count. –  Nov 07 '16 at 22:11
  • Filter out on ratings before the groupby. Example to exclude 3 users. `exclude_users = [405, 655, 69]`. Then you can select ratings that aren't from these users. (the `~` means NOT) `ratings[~ratings.user_id.isin(exclude_users)]` Read the docs on [indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html), especially the "Indexing with isin" – Julien Marrec Nov 07 '16 at 22:25