Fastest way to download 3 million objects from a S3 bucket

Question

I've tried using Python + boto + multiprocessing, S3cmd and J3tset but struggling with all of them.

Any suggestions, perhaps a ready-made script you've been using or another way I don't know of?

EDIT:

eventlet+boto is a worthwhile solution as mentioned below. Found a good eventlet reference article here http://web.archive.org/web/20110520140439/http://teddziuba.com/2010/02/eventlet-asynchronous-io-for-g.html

I've added the python script that I'm using right now below.

Was mostly wondering if it'd be worth it to grab an EC2 instance to condense the files on S3 into larger bundles to save on # of requests you have to make. — Amber, Jan 18 '11 at 07:31

score 36 · Accepted Answer · edited Mar 18 '14 at 13:02

36

Okay, I figured out a solution based on @Matt Billenstien's hint. It uses eventlet library. The first step is most important here (monkey patching of standard IO libraries).

Run this script in the background with nohup and you're all set.

from eventlet import *
patcher.monkey_patch(all=True)

import os, sys, time
from boto.s3.connection import S3Connection
from boto.s3.bucket import Bucket

import logging

logging.basicConfig(filename="s3_download.log", level=logging.INFO)


def download_file(key_name):
    # Its imp to download the key from a new connection
    conn = S3Connection("KEY", "SECRET")
    bucket = Bucket(connection=conn, name="BUCKET")
    key = bucket.get_key(key_name)

    try:
        res = key.get_contents_to_filename(key.name)
    except:
        logging.info(key.name+":"+"FAILED")

if __name__ == "__main__":
    conn = S3Connection("KEY", "SECRET")
    bucket = Bucket(connection=conn, name="BUCKET")

    logging.info("Fetching bucket list")
    bucket_list = bucket.list(prefix="PREFIX")

    logging.info("Creating a pool")
    pool = GreenPool(size=20)

    logging.info("Saving files in bucket...")
    for key in bucket.list():
        pool.spawn_n(download_file, key.key)
    pool.waitall()

edited Mar 18 '14 at 13:02

Jan Vlcinsky

42,725
12
101
98

answered Jan 18 '11 at 06:51

Jagtesh Chadha

2,632
2
23
30

2

Note, I've had issues if I don't create a connection in each greenlet. Were you table to download all your objects using this? – Matt Billenstein Jan 18 '11 at 20:13
2

No. I had issues too. It stopped working after downloading 4000 objects. I didn't have time to debug it, so I ended up using *s3cmd get* from a shell script for each file. I divided the list of filenames on S3 into several sets and ran the script on 7-8 sets at a time (so I had 7-8 *s3cmd get* requests at any point of time). Use boto's *bucket.list()* method to get the filelist and then use the *split* shell command to create equally sized sets. This might consume more CPU than the eventlet approach but its simple and gets the job done. – Jagtesh Chadha Jan 22 '11 at 03:25
I've edited the code to create a new connection for each file download (this plays nice with green threads). – Jagtesh Chadha Jan 03 '13 at 08:47
3

added ```pool.waitall()``` to the end, otherwise the code did nothing and exited before any download really completed. – Jan Vlcinsky Mar 20 '14 at 15:41
Bug fixed - bucket_list was not used. Since I was asked to alter more than 6 chars, added notice about missing subdirs, too. – Juraj Nov 15 '14 at 17:48

score 5 · Answer 2 · answered Jan 18 '11 at 05:43

5

Use eventlet to give you I/O parallelism, write a simple function to download one object using urllib, then use a GreenPile to map that to a list of input urls -- a pile with 50 to 100 greenlets should do...

answered Jan 18 '11 at 05:43

Matt Billenstein

678
1
4
7

1

Thanks for the tip. But isn't this something like using multiprocessing.Pool? – Jagtesh Chadha Jan 18 '11 at 06:02

Fastest way to download 3 million objects from a S3 bucket

2 Answers2

Linked