6

I have a bucket in s3 called "sample-data". Inside the Bucket I have folders labelled "A" to "Z".

Inside each alphabetical folder there are more files and folders. What is the fastest way to download the alphabetical folder and all it's content?

For example --> sample-data/a/foo.txt,more_files/foo1.txt

In the above example the bucket sample-data contains an folder called a which contains foo.txt and a folder called more_files which contains foo1.txt

I know how to download a single file. For instance if i wanted foo.txt I would do the following.

    s3 = boto3.client('s3')
    s3.download_file("sample-data", "a/foo.txt", "foo.txt")

However i am wondering if i can download the folder called a and all it's contents entirely? Any help would be appreciated.

Dinero
  • 1,070
  • 2
  • 19
  • 44
  • You can only download one file at a time. However, you could use multithreading to request multiple files in parallel. (The AWS CLI does this.) – John Rotenstein Oct 07 '20 at 01:19
  • You can check an answer I have given on a similar question: https://stackoverflow.com/questions/49772151/download-a-folder-from-s3-using-boto3/54672690?noredirect=1#comment113580555_54672690 – Konstantinos Katsantonis Oct 07 '20 at 11:39

2 Answers2

18

I think your best bet would be the awscli

aws s3 cp --recursive s3://mybucket/your_folder_named_a path/to/your/destination

From the docs:

--recursive (boolean) Command is performed on all files or objects under the specified directory or prefix.

EDIT:

However, to do this with boto3 try this:

import os
import errno
import boto3

client = boto3.client('s3')


def assert_dir_exists(path):
    try:
        os.makedirs(path)
    except OSError as e:
        if e.errno != errno.EEXIST:
            raise


def download_dir(bucket, path, target):
    # Handle missing / at end of prefix
    if not path.endswith('/'):
        path += '/'

    paginator = client.get_paginator('list_objects_v2')
    for result in paginator.paginate(Bucket=bucket, Prefix=path):
        # Download each file individually
        for key in result['Contents']:
            # Calculate relative path
            rel_path = key['Key'][len(path):]
            # Skip paths ending in /
            if not key['Key'].endswith('/'):
                local_file_path = os.path.join(target, rel_path)
                # Make sure directories exist
                local_file_dir = os.path.dirname(local_file_path)
                assert_dir_exists(local_file_dir)
                client.download_file(bucket, key['Key'], local_file_path)


download_dir('your_bucket', 'your_folder', 'destination')
baduker
  • 19,152
  • 9
  • 33
  • 56
  • is it not possible to do similar using boto3? – Dinero Oct 06 '20 at 17:06
  • @Dinero it's possible but it's error prone and it's not as easy as with `awscli` – baduker Oct 06 '20 at 17:34
  • 1
    The OP seems to be writing a Python application using the Amazon boto3 SDK. Recommending the use of a command line tool here is unlikely to be The Right Answer. Sure, Python could `subprocess.run` the AWS CLI, but that's almost certainly wrong and instead they should list and download through the SDK. – Adam Smith Oct 06 '20 at 17:39
  • It seems likely that OP will want to handle the subdirectories too (which should be straightforward with your existing code: `if key['Key]'.endswith('/'): download_dir(bucket, key['Key'], f"{target}/{key['Key]'}")` – Adam Smith Oct 07 '20 at 17:36
0

You list all the objects in the folder you want to download. Then iterate file by file and download it.

import boto3 
s3 = boto3.client("s3")
response = s3.list_objects_v2(
    Bucket=BUCKET,
    Prefix ='DIR1/DIR2', 
)

The response is of type dict. The key that contains the list of the file names is "Contents"

Here are more information:

list all files in a bucket

boto3 documentation

I am not sure if this is the fastest solution, but it can help you.

SWater
  • 384
  • 1
  • 13