Here is a snippet of my dataset:
userId movieId rating timestamp
97809 1 3.0 2008-06-11 04:47:11
106140 1 5.0 2013-01-29 03:33:49
106138 1 3.0 2002-07-31 15:48:53
70354 1 4.5 2011-02-13 18:55:40
70355 1 3.5 2008-01-26 16:56:54
70356 1 3.0 2012-11-01 16:34:45
31554 1 4.0 1999-08-24 17:23:39
117716 1 4.0 2001-03-28 07:20:04
70358 1 3.0 2007-01-27 16:17:11
70360 1 5.0 1997-03-16 20:52:42
98815 1 5.0 2009-10-02 05:01:51
106137 1 3.5 2006-06-03 11:32:48
98816 1 4.0 1998-07-29 17:31:21
18998 1 3.5 2010-07-10 23:28:11
85495 1 4.0 2014-11-11 00:51:07
40850 1 1.5 2003-10-05 02:11:50
85494 1 5.0 2011-02-09 22:59:27
31556 1 4.5 2011-12-18 05:51:59
70366 1 3.0 1996-12-26 06:00:06
12176 1 4.0 1997-07-13 20:12:56
There are several rows for each movieId having different ratings given by different userIds. I want to get an average rating for each movieId.
Here is the approach I tried:
rat_1 = pd.DataFrame()
for i in range(0,len(k)): # k is a list containing all the unique movieIds
rat_2 = rating[rating['movieId']==k[i]] # Taking a subset of the original dataframe containing rows only of
# the specified movieId
rat_2['rating']=sum(rat_2['rating'])/len(rat_2) # Calculating average rating
rat_1 = pd.concat([rat_1,rat_2]) # Appending the subset dataframe to a new dataframe
However, the file is fairly big (about 660 MB) because of which the code is taking too long to execute.
Is there a faster way to do this?
Thank you in advance!
P.S. This is the first time I'm posting a question here so I apologize if my doubt is not clear enough.