1

Here is a snippet of my dataset:

 userId  movieId  rating            timestamp
  97809        1     3.0  2008-06-11 04:47:11
 106140        1     5.0  2013-01-29 03:33:49
 106138        1     3.0  2002-07-31 15:48:53
  70354        1     4.5  2011-02-13 18:55:40
  70355        1     3.5  2008-01-26 16:56:54
  70356        1     3.0  2012-11-01 16:34:45
  31554        1     4.0  1999-08-24 17:23:39
 117716        1     4.0  2001-03-28 07:20:04
  70358        1     3.0  2007-01-27 16:17:11
  70360        1     5.0  1997-03-16 20:52:42
  98815        1     5.0  2009-10-02 05:01:51
 106137        1     3.5  2006-06-03 11:32:48
  98816        1     4.0  1998-07-29 17:31:21
  18998        1     3.5  2010-07-10 23:28:11
  85495        1     4.0  2014-11-11 00:51:07
  40850        1     1.5  2003-10-05 02:11:50
  85494        1     5.0  2011-02-09 22:59:27
  31556        1     4.5  2011-12-18 05:51:59
  70366        1     3.0  1996-12-26 06:00:06
  12176        1     4.0  1997-07-13 20:12:56

There are several rows for each movieId having different ratings given by different userIds. I want to get an average rating for each movieId.

Here is the approach I tried:

rat_1 = pd.DataFrame()

for i in range(0,len(k)): # k is a list containing all the unique movieIds
    
    rat_2 = rating[rating['movieId']==k[i]] # Taking a subset of the original dataframe containing rows only of
                                            # the specified movieId 
    
    rat_2['rating']=sum(rat_2['rating'])/len(rat_2) # Calculating average rating
    

    
    rat_1 = pd.concat([rat_1,rat_2]) # Appending the subset dataframe to a new dataframe

However, the file is fairly big (about 660 MB) because of which the code is taking too long to execute. Is there a faster way to do this?
Thank you in advance!
P.S. This is the first time I'm posting a question here so I apologize if my doubt is not clear enough.

  • Does this answer your question? [pandas get column average/mean with round value](https://stackoverflow.com/questions/31037298/pandas-get-column-average-mean-with-round-value) – deadshot Jun 30 '20 at 04:46

3 Answers3

2

You should use groupby and mean.

df.groupby("movieId")['rating'].mean()
taha
  • 722
  • 7
  • 15
0

If you just want a rating only, @taha's answer is for you, but if you want to have a rating on a per-record basis, I think it's the following.

import pandas as pd
import numpy as np
import io

data = '''
id userid movieid rating timestamp
1 123 1 3.0 "2020-01-01 00:00:00"
2 121 1 4.0 "2020-01-01 00:00:00"
3 133 1 2.0 "2020-01-01 00:00:00"
4 144 2 1.0 "2020-01-01 00:00:00"
5 145 3 5.0 "2020-01-01 00:00:00"
6 167 3 3.5 "2020-01-01 00:00:00"
7 169 2 2.5 "2020-01-01 00:00:00"
8 254 1 4.5 "2020-01-01 00:00:00"
9 434 2 4.0 "2020-01-01 00:00:00"
10 534 3 3.5 "2020-01-01 00:00:00"
'''

df = pd.read_csv(io.StringIO(data), sep='\s+', index_col=0)

df['raiting_mean'] = df.groupby(['movieid'])['rating'].transform('mean')

df
    userid  movieid rating  timestamp   raiting_mean
id                  
1   123 1   3.0 2020-01-01 00:00:00 3.375
2   121 1   4.0 2020-01-01 00:00:00 3.375
3   133 1   2.0 2020-01-01 00:00:00 3.375
4   144 2   1.0 2020-01-01 00:00:00 2.500
5   145 3   5.0 2020-01-01 00:00:00 4.000
6   167 3   3.5 2020-01-01 00:00:00 4.000
7   169 2   2.5 2020-01-01 00:00:00 2.500
8   254 1   4.5 2020-01-01 00:00:00 3.375
9   434 2   4.0 2020-01-01 00:00:00 2.500
10  534 3   3.5 2020-01-01 00:00:00 4.000
r-beginners
  • 31,170
  • 3
  • 14
  • 32
0

Just want to clarify why lopping over all movies is slow. Basically a for loop in python is slow because it is just sugar code. So you should use group by and mean as @taha reply because those ops are already optimized.

Victor Luu
  • 244
  • 2
  • 10