32

I've tried using Python + boto + multiprocessing, S3cmd and J3tset but struggling with all of them.

Any suggestions, perhaps a ready-made script you've been using or another way I don't know of?

EDIT:

eventlet+boto is a worthwhile solution as mentioned below. Found a good eventlet reference article here http://web.archive.org/web/20110520140439/http://teddziuba.com/2010/02/eventlet-asynchronous-io-for-g.html

I've added the python script that I'm using right now below.

Imran
  • 87,203
  • 23
  • 98
  • 131
Jagtesh Chadha
  • 2,632
  • 2
  • 23
  • 30

2 Answers2

36

Okay, I figured out a solution based on @Matt Billenstien's hint. It uses eventlet library. The first step is most important here (monkey patching of standard IO libraries).

Run this script in the background with nohup and you're all set.

from eventlet import *
patcher.monkey_patch(all=True)

import os, sys, time
from boto.s3.connection import S3Connection
from boto.s3.bucket import Bucket

import logging

logging.basicConfig(filename="s3_download.log", level=logging.INFO)


def download_file(key_name):
    # Its imp to download the key from a new connection
    conn = S3Connection("KEY", "SECRET")
    bucket = Bucket(connection=conn, name="BUCKET")
    key = bucket.get_key(key_name)

    try:
        res = key.get_contents_to_filename(key.name)
    except:
        logging.info(key.name+":"+"FAILED")

if __name__ == "__main__":
    conn = S3Connection("KEY", "SECRET")
    bucket = Bucket(connection=conn, name="BUCKET")

    logging.info("Fetching bucket list")
    bucket_list = bucket.list(prefix="PREFIX")

    logging.info("Creating a pool")
    pool = GreenPool(size=20)

    logging.info("Saving files in bucket...")
    for key in bucket.list():
        pool.spawn_n(download_file, key.key)
    pool.waitall()
Jan Vlcinsky
  • 42,725
  • 12
  • 101
  • 98
Jagtesh Chadha
  • 2,632
  • 2
  • 23
  • 30
  • 2
    Note, I've had issues if I don't create a connection in each greenlet. Were you table to download all your objects using this? – Matt Billenstein Jan 18 '11 at 20:13
  • 2
    No. I had issues too. It stopped working after downloading 4000 objects. I didn't have time to debug it, so I ended up using *s3cmd get* from a shell script for each file. I divided the list of filenames on S3 into several sets and ran the script on 7-8 sets at a time (so I had 7-8 *s3cmd get* requests at any point of time). Use boto's *bucket.list()* method to get the filelist and then use the *split* shell command to create equally sized sets. This might consume more CPU than the eventlet approach but its simple and gets the job done. – Jagtesh Chadha Jan 22 '11 at 03:25
  • I've edited the code to create a new connection for each file download (this plays nice with green threads). – Jagtesh Chadha Jan 03 '13 at 08:47
  • 3
    added ```pool.waitall()``` to the end, otherwise the code did nothing and exited before any download really completed. – Jan Vlcinsky Mar 20 '14 at 15:41
  • Bug fixed - bucket_list was not used. Since I was asked to alter more than 6 chars, added notice about missing subdirs, too. – Juraj Nov 15 '14 at 17:48
5

Use eventlet to give you I/O parallelism, write a simple function to download one object using urllib, then use a GreenPile to map that to a list of input urls -- a pile with 50 to 100 greenlets should do...

Matt Billenstein
  • 678
  • 1
  • 4
  • 7