How to upload small files to Amazon S3 efficiently in Python

Question

Recently, I need to implement a program to upload files resides in Amazon EC2 to S3 in Python as quickly as possible. And the size of files are 30KB.

I have tried some solutions, using multiple threading, multiple processing, co-routine. The following is my performance test result on Amazon EC2.

3600 (the amount of files) * 30K (file size) ~~ 105M (Total) --->

       **5.5s [ 4 process + 100 coroutine ]**
       10s  [ 200 coroutine ]
       14s  [ 10 threads ]

The code as following shown

For multithreading

def mput(i, client, files):
    for f in files:
        if hash(f) % NTHREAD == i:
            put(client, os.path.join(DATA_DIR, f))


def test_multithreading():
    client = connect_to_s3_sevice()
    files = os.listdir(DATA_DIR)
    ths = [threading.Thread(target=mput, args=(i, client, files)) for i in range(NTHREAD)]
    for th in ths:
        th.daemon = True
        th.start()
    for th in ths:
        th.join()

For coroutine

client = connect_to_s3_sevice()
pool = eventlet.GreenPool(int(sys.argv[2]))

xput = functools.partial(put, client)
files = os.listdir(DATA_DIR)
for f in files:
    pool.spawn_n(xput, os.path.join(DATA_DIR, f))
pool.waitall()

For multiprocessing + Coroutine

def pproc(i):
    client = connect_to_s3_sevice()
    files = os.listdir(DATA_DIR)
    pool = eventlet.GreenPool(100)

    xput = functools.partial(put, client)
    for f in files:
        if hash(f) % NPROCESS == i:
            pool.spawn_n(xput, os.path.join(DATA_DIR, f))
    pool.waitall()


def test_multiproc():
    procs = [multiprocessing.Process(target=pproc, args=(i, )) for i in range(NPROCESS)]
    for p in procs:
        p.daemon = True
        p.start()
    for p in procs:
        p.join()

The configuration of the machine is Ubuntu 14.04, 2 CPUs (2.50GHz), 4G Memory

The highest speed reached is about 19Mb/s (105 / 5.5). Overall, it is too slow. Any way to speed it up? Does stackless python could do it faster?

An interesting things is that I could get > 90Mb/s if I set the file size to 1M. — Jacky1205, Dec 15 '14 at 11:06
Question is more informative than answers =D Thanks for the coroutines example — Georgii Oleinikov, Oct 14 '18 at 05:42

score 8 · Answer 1 · answered Dec 15 '14 at 10:29

8

Sample parallel upload times to Amazon S3 using the Python boto SDK are available here:

Parallel S3 Uploads Using Boto and Threads in Python

Rather than writing the code yourself, you might also consider calling out to the AWS Command Line Interface (CLI), which can do uploads in parallel. It is also written in Python and uses boto.

answered Dec 15 '14 at 10:29

John Rotenstein

241,921
22
380
470

2

The link only shows the multiple threading and process only. Not include co-routine and the combination of multiprocessing and co-routine. In my test, the latter will get much better performance. – Jacky1205 Dec 15 '14 at 11:04
1

I have tested the CLI with the simple shell command 'aws s3 cp myfolder s3://mybucket/myfolder'. It also has poor performance. And again, I want to say that the result in article 'Parallel S3 Uploads Using Boto and Threads in Python' was not accurately. How does the author could gain **70x** speed with only **10** threads. It is awesome! – Jacky1205 Dec 16 '14 at 07:26
1

I just tested the approach in 'Parallel S3 Uploads Using Boto and Threads in Python' and can confirm the 70x speedup isn't accurate. Python reports that my code has finished almost instantly, but I can see from monitoring what's actually on s3 that the uploads are still proceeding in the background. Not sure how to get a really accurate time for this method but it looks comparable to the others. – Sohier Dane Sep 08 '16 at 17:17
1

@SohierDane you need to join the processes/threads at the end of your python code if you want the script to wait until uploading is finished, that should give you accurate times. Otherwise the threads detach from the parent process and complete on their own, so your main python script exists instantly. – alfredox Sep 19 '16 at 20:05

score 8 · Answer 2 · answered Feb 03 '16 at 08:14

I recently needed to upload about 5 TB of small files to AWS and reached full network bandwidth ~750Mbits (1 Gb connection per server) without problems by setting a higher "max_concurrent_request" value in the ~/.aws/config file.

I further speeded up the process by starting multiple upload jobs via a bash for-loop und sending these jobs to different servers.

I also tried python eg. s3-parallel-put, but i think this approach is way faster. Of course if the files are too small one should consider: Compressing --> upload to EBS /S3 and decompress there

Here is some code that might help.

$cat .aws/config 
[default]
region = eu-west-1
output = text
s3 =
    max_concurrent_requests = 100

Than start multiple aws copy jobs, eg.:

for folder in `ls`; do aws s3 cp $folder s3://<bucket>/$folder/whatever/; done

this solution looks nice, but it is not in python at all. – Rami Aug 01 '18 at 00:33 — Rami, Aug 01 '18 at 00:33
sure, was just putting it so others may benefit from it. – PlagTag Aug 01 '18 at 14:48 — PlagTag, Aug 01 '18 at 14:48
@Rami You can call the AWS CLI from Python. – GG. Jul 17 '19 at 20:20 — GG., Jul 17 '19 at 20:20

score 1 · Answer 3 · answered Oct 15 '15 at 08:24

1

I have the same problem as You. My solution was send the data to AWS SQS and then save them to S3 using AWS Lambda.

So data flow looks: app -> SQS -> Lambda -> S3

Entire process is asynchronous, but near real-time :)

answered Oct 15 '15 at 08:24

Hkar

11
1

Good solution, but its not a bit overhead? I mean a lot of non free infraestructure just to perform an asynchronous upload. – Imnl Oct 10 '16 at 15:22
1

Yes, definitely there is overhead. But is completely asynchronous and scalable (and that was what I needed). – Hkar Oct 12 '16 at 13:53
@Hkar but will it work in the case if we have huge no of small files ..100000 that needs to be uploaded into S3 ..The max size of the xml file is 20kb .. – Atharv Thakur Sep 04 '18 at 11:34

How to upload small files to Amazon S3 efficiently in Python

3 Answers3

Linked