Now I have a huge TrainDataFrame = pd.dataframe
containing approximately 200 million rows, and part of it is as below. user_id
is from 0 to 5049, and video_id
is from 0 to 62.
user_id video_id event_type time
516194 0 10 stop_video 2015-04-12 21:11:12.846636
516195 0 10 pause_video 2015-04-12 21:11:13.175032
516478 0 10 play_video 2015-04-12 21:04:07.912241
496935 0 15 play_video 2015-04-12 21:22:12.946172
497045 0 15 play_video 2015-04-12 21:21:05.246737
497295 0 15 pause_video 2015-04-12 21:21:38.627993
497061 0 19 stop_video 2015-04-12 21:20:45.325791
497062 0 19 pause_video 2015-04-12 21:20:45.662081
516211 0 19 play_video 2015-04-12 21:11:22.609964
... ... ... ... ...
262646 5049 53 play_video 2015-04-08 01:39:05.066670
262692 5049 53 play_video 2015-04-08 01:38:30.305655
262707 5049 53 play_video 2015-04-08 01:38:01.268150
1293111 5049 56 pause_video 2015-04-26 14:52:08.722080
1293112 5049 56 play_video 2015-04-26 14:51:49.871342
1293114 5049 56 play_video 2015-04-26 14:51:34.916506
1293115 5049 56 play_video 2015-04-26 14:51:33.729953
1293116 5049 56 play_video 2015-04-26 14:51:31.698727
What I want to do is to calculate how many times did each user operated on each video. For example, in the sample data, user_id 0
operated on video_id 10
for 3 times (stop_video
, pause_video
, play_video
), then the operation times for user_id 0
on video_id 10
is 3.
My idea to achieve this calculation is several simple iteration for
:
M = np.zeros([5050,63])
for a in range(5049):
for b in range(62):
if TrainDataFrame['user_id'] == a and TrainDataFrame['video_id'] == b:
M[a,b] = M[a,b] + 1
The codes above might have some problems since I am new to python, but basically it is what I am going to achieve. But since the data set is very huge, and this would cost quite a lot of time. I am wondering whether there is any better methods I can use to solve this issue. I have heard that TrainDataFrame.event_type.value_counts()
or TrainDataFrame.user_id.value_counts()
could process with a faster speed, but I couldn't generate the results I want since the two methods above would calculate all times without diving in users or video ids.