Use muliprocessing or hadoop to speed up python-pandas script on big data

Question

I am treating a big CSV datafile which fileds are: user_id, timestamp, category, and I am building scores for each category for each user. I start by chunking the CSV file, and apply a groupby (on the two last figures of the user_id) on the chunks files so I can store a total of 100 files containing groups of users, and storing them in a HDF5 store.

Then I make a big for loop on my store to process each stored file, one after the other. For each one of them, I groupby on user_id, and then calculate the score for the user. Then I write an output CSV, with one line for each user and containing all his scores.

I noticed that this main loop takes 4 hours on my personal computer, and I want to accelerate it since it looks perfectly parallelizable. How can I? I thought of multiprocessing or hadoop streaming, what would be best?

Here is my (simplified) code:

def sub_group_hash(x):
    return x['user_id'].str[-2:]

reader = read_csv('input.csv', chunksize=500000)                                  
with get_store('grouped_input.h5') as store:
    for chunk in reader:
        groups = chunk.groupby(sub_group_hash(chunk))
        for grp, grouped in groups:
            store.append('group_%s' % grp, grouped,
                 data_columns=['user_id','timestamp','category'])

with open('stats.csv','wb') as outfile:
    spamwriter = csv.writer(outfile)
    with get_store('grouped_input.h5') as store:
        for grp in store.keys(): #this is the loop I would like to parallelize
            grouped = store.select(grp).groupby('user_id')
            for user, user_group in grouped:
                output = my_function(user,user_group)
                spamwriter.writerow([user] + output)

score 0 · Answer 1 · answered Aug 27 '14 at 14:54

0

I would recommend multi-threading. The thread library is pretty easy and intuitive. https://docs.python.org/3/library/threading.html#thread-objects

I am a little confused on what you mean your main loop is but I am assuming its all of the above processes. If this is the case enclose it into a definition and using the more simple context of

import thread
t1 = threading.thread(process, ("any", "inputs"))
t1.start()

A decent tutorial can be found here. Which also shows you a more advanced threading technique if you are familiar enough with python to use it. http://www.tutorialspoint.com/python/python_multithreading.htm

The tricky thing is when you are writing to the file you don't want all of the processes to write to the files at once but luckily you can create a chock point with lock. The acquire() and release() function surrounding this process will make sure only one thread is writing at a time.

Also pay attention to how many cores you have available on your pc. If you run more threads then cores on your pc then each thread will have to wait for cpu time and you are not gaining that much in terms of speed. Also you can fork bomb your computer pretty easily if you create an infinite amount of processes.

answered Aug 27 '14 at 14:54

Eric Thomas

667
1
8
18

Hi, thank you for your answer, I updated my post with a comment in the code to indicate the loop I want to parallelize – sweeeeeet Aug 27 '14 at 14:59
And can I write simultaneously on many files, then finally cat them in a single one? – sweeeeeet Aug 27 '14 at 15:01
Yeah you can always do that later. You would most likely have to make the various files inputs for your definition. You can then just create a definition at the end that reads each file line by line and writes them into a single output. – Eric Thomas Aug 27 '14 at 15:24
And if I create 100 threads, how will manage my 4 cores? Will they deal with threads 4 by 4 untill all are done? What would be the problem If I do this? – sweeeeeet Aug 27 '14 at 15:38
1

If you have 4 cores in can only handle 4 at a time. What you will have with 100 threads is 96 threads trying to get started while your cpu is running the others. This can cause severe slowing and even lock up your computer since you are still using some cpu to have them in que. It's best not to go to far bast the amount of cores you have but you will have to monitor that for yourself. – Eric Thomas Aug 27 '14 at 16:01
1

multi threading is almost never the right answer; you have resource contention and most of the time it overwhelms the perf benefit unless your bottleneck is io; multiprocessing is cleaner, better behaved and will utilize your processors better. that said since u didn't indicate I believe this is computationally bound so multiprocessing might help, ymmv. further you have to be extremely careful using HDF5; you can never write to the same file in process or thread; reading is ok; pls read the HDFStore documentation fully especially the caveats section – Jeff Aug 27 '14 at 17:50
1

There are two camps of thoughts with this and has been discussed over and over again. The biggest take away is don't use multiprocessing when multi-threading will do. http://stackoverflow.com/questions/731993/multiprocessing-or-multithreading?rq=1 – Eric Thomas Aug 27 '14 at 18:00

Use muliprocessing or hadoop to speed up python-pandas script on big data

1 Answers1