18

Recently, I need to implement a program to upload files resides in Amazon EC2 to S3 in Python as quickly as possible. And the size of files are 30KB.

I have tried some solutions, using multiple threading, multiple processing, co-routine. The following is my performance test result on Amazon EC2.

3600 (the amount of files) * 30K (file size) ~~ 105M (Total) --->

       **5.5s [ 4 process + 100 coroutine ]**
       10s  [ 200 coroutine ]
       14s  [ 10 threads ]

The code as following shown

For multithreading

def mput(i, client, files):
    for f in files:
        if hash(f) % NTHREAD == i:
            put(client, os.path.join(DATA_DIR, f))


def test_multithreading():
    client = connect_to_s3_sevice()
    files = os.listdir(DATA_DIR)
    ths = [threading.Thread(target=mput, args=(i, client, files)) for i in range(NTHREAD)]
    for th in ths:
        th.daemon = True
        th.start()
    for th in ths:
        th.join()

For coroutine

client = connect_to_s3_sevice()
pool = eventlet.GreenPool(int(sys.argv[2]))

xput = functools.partial(put, client)
files = os.listdir(DATA_DIR)
for f in files:
    pool.spawn_n(xput, os.path.join(DATA_DIR, f))
pool.waitall()

For multiprocessing + Coroutine

def pproc(i):
    client = connect_to_s3_sevice()
    files = os.listdir(DATA_DIR)
    pool = eventlet.GreenPool(100)

    xput = functools.partial(put, client)
    for f in files:
        if hash(f) % NPROCESS == i:
            pool.spawn_n(xput, os.path.join(DATA_DIR, f))
    pool.waitall()


def test_multiproc():
    procs = [multiprocessing.Process(target=pproc, args=(i, )) for i in range(NPROCESS)]
    for p in procs:
        p.daemon = True
        p.start()
    for p in procs:
        p.join()

The configuration of the machine is Ubuntu 14.04, 2 CPUs (2.50GHz), 4G Memory

The highest speed reached is about 19Mb/s (105 / 5.5). Overall, it is too slow. Any way to speed it up? Does stackless python could do it faster?

Jacky1205
  • 3,273
  • 3
  • 22
  • 44

3 Answers3

8

Sample parallel upload times to Amazon S3 using the Python boto SDK are available here:

Rather than writing the code yourself, you might also consider calling out to the AWS Command Line Interface (CLI), which can do uploads in parallel. It is also written in Python and uses boto.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • 2
    The link only shows the multiple threading and process only. Not include co-routine and the combination of multiprocessing and co-routine. In my test, the latter will get much better performance. – Jacky1205 Dec 15 '14 at 11:04
  • 1
    I have tested the CLI with the simple shell command 'aws s3 cp myfolder s3://mybucket/myfolder'. It also has poor performance. And again, I want to say that the result in article 'Parallel S3 Uploads Using Boto and Threads in Python' was not accurately. How does the author could gain **70x** speed with only **10** threads. It is awesome! – Jacky1205 Dec 16 '14 at 07:26
  • 1
    I just tested the approach in 'Parallel S3 Uploads Using Boto and Threads in Python' and can confirm the 70x speedup isn't accurate. Python reports that my code has finished almost instantly, but I can see from monitoring what's actually on s3 that the uploads are still proceeding in the background. Not sure how to get a really accurate time for this method but it looks comparable to the others. – Sohier Dane Sep 08 '16 at 17:17
  • 1
    @SohierDane you need to join the processes/threads at the end of your python code if you want the script to wait until uploading is finished, that should give you accurate times. Otherwise the threads detach from the parent process and complete on their own, so your main python script exists instantly. – alfredox Sep 19 '16 at 20:05
8

I recently needed to upload about 5 TB of small files to AWS and reached full network bandwidth ~750Mbits (1 Gb connection per server) without problems by setting a higher "max_concurrent_request" value in the ~/.aws/config file.

I further speeded up the process by starting multiple upload jobs via a bash for-loop und sending these jobs to different servers.

I also tried python eg. s3-parallel-put, but i think this approach is way faster. Of course if the files are too small one should consider: Compressing --> upload to EBS /S3 and decompress there

Here is some code that might help.

$cat .aws/config 
[default]
region = eu-west-1
output = text
s3 =
    max_concurrent_requests = 100

Than start multiple aws copy jobs, eg.:

for folder in `ls`; do aws s3 cp $folder s3://<bucket>/$folder/whatever/; done
PlagTag
  • 6,107
  • 6
  • 36
  • 48
1

I have the same problem as You. My solution was send the data to AWS SQS and then save them to S3 using AWS Lambda.

So data flow looks: app -> SQS -> Lambda -> S3

Entire process is asynchronous, but near real-time :)

Hkar
  • 11
  • 1
  • Good solution, but its not a bit overhead? I mean a lot of non free infraestructure just to perform an asynchronous upload. – Imnl Oct 10 '16 at 15:22
  • 1
    Yes, definitely there is overhead. But is completely asynchronous and scalable (and that was what I needed). – Hkar Oct 12 '16 at 13:53
  • @Hkar but will it work in the case if we have huge no of small files ..100000 that needs to be uploaded into S3 ..The max size of the xml file is 20kb .. – Atharv Thakur Sep 04 '18 at 11:34