my problem is the following : my application needs to upload multiple files simultaneously to S3 using the Boto library in python. I've worked out 2 solutions, but I'm not sure of the implications of each. Some considerations :
- This will be running on EC2 micro instances, so low memory, low CPU
- Usually 1-10 files need to be uploaded at once, but can be more
Solutions, fastest then slowest:
1) Creating threads "manually" with from threading import Thread
. This executes in aprox. 0.02 seconds.
from boto.s3.connection import S3Connection
from threading import Thread
import time
filenames = ['1.json', '2.json', '3.json', '4.json', '5.json', '6.json', '7.json', '8.json', '9.json', '10.json']
def upload(myfile):
conn = S3Connection(aws_access_key_id=AWS_KEY, aws_secret_access_key=AWS_SECRET)
bucket = conn.get_bucket("parallel_upload_tests")
key = bucket.new_key(myfile).set_contents_from_string('some content')
return myfile
for fname in filenames:
t = Thread(target = upload, args=(fname,)).start()
2) Using a ThreadPool
from the multiprocessing
module. This takes aprox. 0.3 secs to execute (almost 10x slower)
from boto.s3.connection import S3Connection
from multiprocessing.pool import ThreadPool
import time
filenames = ['1.json', '2.json', '3.json', '4.json', '5.json', '6.json', '7.json', '8.json', '9.json', '10.json']
def upload(myfile):
conn = S3Connection(aws_access_key_id=AWS_KEY, aws_secret_access_key=AWS_SECRET)
bucket = conn.get_bucket("parallel_upload_tests")
key = bucket.new_key(myfile).set_contents_from_string('some content')
return myfile
pool = ThreadPool(processes=16)
pool.map(upload, filenames)
- What is the difference between these 2 approaches that makes the threadpool 10x slower?
- Any alternate suggestions for different approaches or recommendations for what I've come up with?
Many thanks.
EDIT : I also just realized that multiprocessing
has a pool
(which presumably creates new processes) AND ThreadPool
(which presumably creates thread workers). I'm a bit confused.