I am treating a big CSV datafile which fileds are: user_id
, timestamp
, category
, and I am building scores for each category for each user.
I start by chunking the CSV file, and apply a groupby
(on the two last figures of the user_id
) on the chunks files so I can store a total of 100 files containing groups of users, and storing them in a HDF5 store.
Then I make a big for loop on my store to process each stored file, one after the other. For each one of them, I groupby on user_id, and then calculate the score for the user. Then I write an output CSV, with one line for each user and containing all his scores.
I noticed that this main loop takes 4 hours on my personal computer, and I want to accelerate it since it looks perfectly parallelizable. How can I? I thought of multiprocessing
or hadoop streaming
, what would be best?
Here is my (simplified) code:
def sub_group_hash(x):
return x['user_id'].str[-2:]
reader = read_csv('input.csv', chunksize=500000)
with get_store('grouped_input.h5') as store:
for chunk in reader:
groups = chunk.groupby(sub_group_hash(chunk))
for grp, grouped in groups:
store.append('group_%s' % grp, grouped,
data_columns=['user_id','timestamp','category'])
with open('stats.csv','wb') as outfile:
spamwriter = csv.writer(outfile)
with get_store('grouped_input.h5') as store:
for grp in store.keys(): #this is the loop I would like to parallelize
grouped = store.select(grp).groupby('user_id')
for user, user_group in grouped:
output = my_function(user,user_group)
spamwriter.writerow([user] + output)