0

Now I have a huge TrainDataFrame = pd.dataframe containing approximately 200 million rows, and part of it is as below. user_id is from 0 to 5049, and video_id is from 0 to 62.

         user_id  video_id   event_type                        time
516194         0        10   stop_video  2015-04-12 21:11:12.846636
516195         0        10  pause_video  2015-04-12 21:11:13.175032
516478         0        10   play_video  2015-04-12 21:04:07.912241
496935         0        15   play_video  2015-04-12 21:22:12.946172
497045         0        15   play_video  2015-04-12 21:21:05.246737
497295         0        15  pause_video  2015-04-12 21:21:38.627993
497061         0        19   stop_video  2015-04-12 21:20:45.325791
497062         0        19  pause_video  2015-04-12 21:20:45.662081
516211         0        19   play_video  2015-04-12 21:11:22.609964
...          ...       ...          ...                         ...
262646      5049        53   play_video  2015-04-08 01:39:05.066670
262692      5049        53   play_video  2015-04-08 01:38:30.305655
262707      5049        53   play_video  2015-04-08 01:38:01.268150
1293111     5049        56  pause_video  2015-04-26 14:52:08.722080
1293112     5049        56   play_video  2015-04-26 14:51:49.871342
1293114     5049        56   play_video  2015-04-26 14:51:34.916506
1293115     5049        56   play_video  2015-04-26 14:51:33.729953
1293116     5049        56   play_video  2015-04-26 14:51:31.698727

What I want to do is to calculate how many times did each user operated on each video. For example, in the sample data, user_id 0 operated on video_id 10 for 3 times (stop_video, pause_video, play_video), then the operation times for user_id 0 on video_id 10 is 3.

My idea to achieve this calculation is several simple iteration for:

M = np.zeros([5050,63])
for a in range(5049): 
     for b in range(62):
        if TrainDataFrame['user_id'] == a and TrainDataFrame['video_id'] == b:
           M[a,b] = M[a,b] + 1

The codes above might have some problems since I am new to python, but basically it is what I am going to achieve. But since the data set is very huge, and this would cost quite a lot of time. I am wondering whether there is any better methods I can use to solve this issue. I have heard that TrainDataFrame.event_type.value_counts() or TrainDataFrame.user_id.value_counts() could process with a faster speed, but I couldn't generate the results I want since the two methods above would calculate all times without diving in users or video ids.

Vinnton
  • 57
  • 6

0 Answers0