python parallelism for uploading files

Question

my problem is the following : my application needs to upload multiple files simultaneously to S3 using the Boto library in python. I've worked out 2 solutions, but I'm not sure of the implications of each. Some considerations :

This will be running on EC2 micro instances, so low memory, low CPU
Usually 1-10 files need to be uploaded at once, but can be more

Solutions, fastest then slowest:

1) Creating threads "manually" with from threading import Thread. This executes in aprox. 0.02 seconds.

from boto.s3.connection import S3Connection
from threading import Thread
import time

filenames = ['1.json', '2.json', '3.json', '4.json', '5.json', '6.json', '7.json', '8.json', '9.json', '10.json']
def upload(myfile):
        conn = S3Connection(aws_access_key_id=AWS_KEY, aws_secret_access_key=AWS_SECRET)
        bucket = conn.get_bucket("parallel_upload_tests")
        key = bucket.new_key(myfile).set_contents_from_string('some content')
        return myfile

for fname in filenames:
        t = Thread(target = upload, args=(fname,)).start()

2) Using a ThreadPool from the multiprocessing module. This takes aprox. 0.3 secs to execute (almost 10x slower)

from boto.s3.connection import S3Connection
from multiprocessing.pool import ThreadPool
import time

filenames = ['1.json', '2.json', '3.json', '4.json', '5.json', '6.json', '7.json', '8.json', '9.json', '10.json']
def upload(myfile):
        conn = S3Connection(aws_access_key_id=AWS_KEY, aws_secret_access_key=AWS_SECRET)
        bucket = conn.get_bucket("parallel_upload_tests")
        key = bucket.new_key(myfile).set_contents_from_string('some content')
        return myfile

pool = ThreadPool(processes=16)
pool.map(upload, filenames)

What is the difference between these 2 approaches that makes the threadpool 10x slower?
Any alternate suggestions for different approaches or recommendations for what I've come up with?

Many thanks.

EDIT : I also just realized that multiprocessing has a pool (which presumably creates new processes) AND ThreadPool (which presumably creates thread workers). I'm a bit confused.

score 4 · Accepted Answer · answered Jun 07 '13 at 19:28

4

Python uses OS threads. While you don't gain anything for CPU-bound tasks, threads are fine for IO-bound task as yours. The GIL, the Global Interpreter Lock, will br for IO released.

The multiprocessing module is designed for CPU-bound tasks. In your case it starts 16 new processes. That takes time. Typically, it does not make sense to stark more workers/processes than you have CPUs. My rule of thumb is number_of_workers = number_of_cpus - 1. Furthermore, it communicates using pickle for communication between processes. If you needed to do the upload many times in a row, you might want to try to start several workers and keep them alive and reuse them over and over again. This may justify the overhead of starting new processes as long as you do noticeable computation for each upload. You need to profile this for your case.

A third option would be to go asynchronous. For example, you could use Twisted. Then you need to restructure your code, since you need to work with callbacks.

answered Jun 07 '13 at 19:28

Mike Müller

82,630
20
166
161

Thanks for the details. Since I'll be running this on EC2's smallest machines, it might not make sense to use the `multiprocessing` module. Since option #1 was the fastest, what are the pitfalls of creating a bunch of OS threads - other than being expensive? How many threads are recommended for something like this? (it doesn't seem right to create 20 threads if I have 20 uploads...) – L-R Jun 07 '13 at 19:55
This depends on a lot of factors. The simplest is just to try it out. Run without threads than with 2, 4, 8, 16 .... Use different sizes for your json files. And of course, try the whole suite of options at different times for which expect different network traffic to get a good estimate how things work out in real life. Performance is all about measuring. I've been wrong so many times with my gut feelings. You need to generate hard numbers for realistic cases. – Mike Müller Jun 07 '13 at 20:22
Great. Your answer echoes [this](http://stackoverflow.com/questions/481970/how-many-threads-is-too-many), actually. One last thing, the [docs for Thread Objects](http://docs.python.org/2/library/threading.html#thread-objects) mention that "[a thread] stops being alive when its run() method terminates". If I understand correctly, I don't need to explicitly clean up after a thread has done uploading its file, correct? – L-R Jun 07 '13 at 20:33
Yes, as soon as you return you are done. You can always check with `threading.enumerate()` how many threads are currently alive. – Mike Müller Jun 07 '13 at 21:15

python parallelism for uploading files

1 Answers1