I have a very big CSV file (tens of Gigas) containing web logs with the following columns: user_id, time_stamp, category_clicked
. I have to build a scorer to identify what categories users like and dislike.
My problem comes when I have to load my CSV using pandas.read_csv
. Indeed, I would like to use the chunksize
parameter to split it, but since I have to proceed a 'groupby
operation' on the user_id
s to make my calculation (I don't want my score to be too trivial), I don't know how to smartly split my data because If I only use chunsize, I won't be able to properly use groupby.
To be simple, I want to do a calculation for each user, which depends on the timestamp and the category clicked. For instance, give 1 point to the user if his click happened one month ago, 2 points if it happened two weeks ago, and 4 points if it happened last week.
How can I do? And I am missing something?