2

I have a very big CSV file (tens of Gigas) containing web logs with the following columns: user_id, time_stamp, category_clicked. I have to build a scorer to identify what categories users like and dislike.

My problem comes when I have to load my CSV using pandas.read_csv . Indeed, I would like to use the chunksize parameter to split it, but since I have to proceed a 'groupby operation' on the user_ids to make my calculation (I don't want my score to be too trivial), I don't know how to smartly split my data because If I only use chunsize, I won't be able to properly use groupby.

To be simple, I want to do a calculation for each user, which depends on the timestamp and the category clicked. For instance, give 1 point to the user if his click happened one month ago, 2 points if it happened two weeks ago, and 4 points if it happened last week.

How can I do? And I am missing something?

sweeeeeet
  • 1,769
  • 4
  • 26
  • 50
  • Possibly related: [Pandas GroupBy Mean of Large DataSet in CSV](http://stackoverflow.com/q/23190156/478288). – chrisaycock Aug 19 '14 at 15:45
  • I am sorry but It is not really related, because the solution given is very specific to the function mean(), and won't work in my case. – sweeeeeet Aug 19 '14 at 15:49
  • 2
    You basically need to do this: http://stackoverflow.com/questions/15798209/pandas-group-by-query-on-large-data-in-hdfstore. In a nutshell, read in your data using read_csv, save to a hdfstore (table format). Then you can get the keys of the groupby (user_id), and aggregate as needed with a minimum of queries. This is quite scalable. – Jeff Aug 19 '14 at 16:04
  • I would really appreciate if someone could give me more details in the case that there is more than a million groups. I think It could be usefull to more than me. – sweeeeeet Aug 21 '14 at 07:41

0 Answers0