60

Using Boto3 Python SDK, I was able to download files using the method bucket.download_file()

Is there a way to download an entire folder?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
El Fadel Anas
  • 1,581
  • 2
  • 18
  • 25
  • 2
    Possibly duplicate- https://stackoverflow.com/questions/31918960/boto3-to-download-all-files-from-a-s3-bucket/31960438 – Yoav Gaudin Apr 11 '18 at 10:10
  • 3
    Possible duplicate of [Boto3 to download all files from a S3 Bucket](https://stackoverflow.com/questions/31918960/boto3-to-download-all-files-from-a-s3-bucket) – Vincent de Lagabbe Jan 23 '19 at 14:26

10 Answers10

89

quick and dirty but it works:

import boto3
import os 

def downloadDirectoryFroms3(bucketName, remoteDirectoryName):
    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(bucketName) 
    for obj in bucket.objects.filter(Prefix = remoteDirectoryName):
        if not os.path.exists(os.path.dirname(obj.key)):
            os.makedirs(os.path.dirname(obj.key))
        bucket.download_file(obj.key, obj.key) # save to same path

Assuming you want to download the directory foo/bar from s3 then the for-loop will iterate all the files whose path starts with the Prefix=foo/bar.

50

A slightly less dirty modification of the accepted answer by Konstantinos Katsantonis:

import boto3
s3 = boto3.resource('s3') # assumes credentials & configuration are handled outside python in .aws directory or environment variables

def download_s3_folder(bucket_name, s3_folder, local_dir=None):
    """
    Download the contents of a folder directory
    Args:
        bucket_name: the name of the s3 bucket
        s3_folder: the folder path in the s3 bucket
        local_dir: a relative or absolute directory path in the local file system
    """
    bucket = s3.Bucket(bucket_name)
    for obj in bucket.objects.filter(Prefix=s3_folder):
        target = obj.key if local_dir is None \
            else os.path.join(local_dir, os.path.relpath(obj.key, s3_folder))
        if not os.path.exists(os.path.dirname(target)):
            os.makedirs(os.path.dirname(target))
        if obj.key[-1] == '/':
            continue
        bucket.download_file(obj.key, target)

This downloads nested subdirectories, too. I was able to download a directory with over 3000 files in it. You'll find other solutions at Boto3 to download all files from a S3 Bucket, but I don't know if they're any better.

HolyGuacamole
  • 61
  • 1
  • 5
bjc
  • 1,201
  • 12
  • 10
19

You could also use cloudpathlib which, for S3, wraps boto3. For your use case, it's pretty simple:

from cloudpathlib import CloudPath

cp = CloudPath("s3://bucket/folder/folder2/")
cp.download_to("local_folder")

hume
  • 2,413
  • 19
  • 21
  • does somebody know, if AWS counts this as one request for the billing?! – Alex Aug 11 '21 at 20:47
  • Probably not. It should work out to about same as looping over each key with `boto3` (maybe with an added call to list objects, but you need that in both cases) – hume Aug 11 '21 at 23:55
  • for me it only worked without the trailing `/`... in the example above it would be: `cp = CloudPath("s3://bucket/folder/folder2")` – Luiz Tauffer Dec 16 '21 at 20:01
  • @hume Can I pass the relative path to the CloudPath. For example: "s3://bucket/*/*/device/" ? – trungducng Jan 07 '22 at 07:08
  • 1
    @trungducng There is a `glob` method like with a normal `Path` that you can use to loop over those files and call `download_to` on each one individually. https://cloudpathlib.drivendata.org/stable/api-reference/s3path/#cloudpathlib.s3.s3path.S3Path.glob – hume Jan 20 '22 at 19:34
  • 2
    Why isn't this the top solution!! This tool is awesome. – user2755526 Feb 17 '22 at 17:57
5

Using boto3 you can set aws credentials and download dataset from S3

import boto3
import os 

# set aws credentials 
s3r = boto3.resource('s3', aws_access_key_id='xxxxxxxxxxxxxxxxx',
    aws_secret_access_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
bucket = s3r.Bucket('bucket_name')

# downloading folder 
prefix = 'dirname'
for object in bucket.objects.filter(Prefix = 'dirname'):
    if object.key == prefix:
        os.makedirs(os.path.dirname(object.key), exist_ok=True)
        continue;
    bucket.download_file(object.key, object.key)

If you cannot find ur access_key and secret_access_key, refer to this page
I hope it will helps.
thank you.

Soulduck
  • 569
  • 1
  • 6
  • 17
  • 2
    Better to avoid putting your keys in your code file. At worst, you can put your keys in a separate protected file and import them. It's also possible to use boto3 without any credentials cached and instead use either s3fs or just rely on the config file (https://www.reddit.com/r/aws/comments/73212m/has_anyone_found_a_way_to_hide_boto3_credentials/) – Zach Rieck Jul 28 '20 at 19:02
5

Another approach building on the answer from @bjc that leverages the built in Path library and parses the s3 uri for you:

import boto3
from pathlib import Path
from urllib.parse import urlparse

def download_s3_folder(s3_uri, local_dir=None):
    """
    Download the contents of a folder directory
    Args:
        s3_uri: the s3 uri to the top level of the files you wish to download
        local_dir: a relative or absolute directory path in the local file system
    """
    s3 = boto3.resource("s3")
    bucket = s3.Bucket(urlparse(s3_uri).hostname)
    s3_path = urlparse(s3_uri).path.lstrip('/')
    if local_dir is not None:
        local_dir = Path(local_dir)
    for obj in bucket.objects.filter(Prefix=s3_path):
        target = obj.key if local_dir is None else local_dir / Path(obj.key).relative_to(s3_path)
        target.parent.mkdir(parents=True, exist_ok=True)
        if obj.key[-1] == '/':
            continue
        bucket.download_file(obj.key, str(target))
Matthew Cox
  • 1,047
  • 10
  • 23
3

You can call the awscli cp command from python to download an entire folder

 import os
 import subprocess

 remote_folder_name = 's3://my-bucket/my-dir'
 local_path = '.'
 if not os.path.exists(local_path):
     os.makedirs(local_path)
 subprocess.run(['aws', 's3', 'cp', remote_folder_name, local_path, '--recursive'])

Some notes regarding this solution:

  1. You should install awscli (pip install awscli) and configure it. more info here
  2. If you don't want to override existing files if they weren't changed, you can use sync instead of cp subprocess.run(['aws', 's3', 'sync', remote_folder_name, local_path])
  3. Tested on python 3.6. on earlier versions of python you might need to replace subprocess.run with subprocess.call or os.system
  4. The cli command that is executed by this code is aws s3 cp s3://my-bucket/my-dir . --recursive
Roman Mirochnik
  • 196
  • 1
  • 8
1

The above solutions are good, and rely on S3 Resource.
The following solution, achieves the same goal, but with applying s3_client.
You might find it useful for your end (I've tested it, and it works well).

import boto3
from os import path, makedirs
from botocore.exceptions import ClientError
from boto3.exceptions import S3TransferFailedError

def download_s3_folder(s3_folder, local_dir, aws_access_key_id, aws_secret_access_key, aws_bucket, debug_en):
    """ Download the contents of a folder directory into a local area """

    success = True

    print('[INFO] Downloading %s from bucket %s...' % (s3_folder, aws_bucket))

    def get_all_s3_objects(s3, **base_kwargs):
        continuation_token = None
        while True:
            list_kwargs = dict(MaxKeys=1000, **base_kwargs)
            if continuation_token:
                list_kwargs['ContinuationToken'] = continuation_token
            response = s3.list_objects_v2(**list_kwargs)
            yield from response.get('Contents', [])
            if not response.get('IsTruncated'):
                break
            continuation_token = response.get('NextContinuationToken')

    s3_client = boto3.client('s3',
                             aws_access_key_id=aws_access_key_id,
                             aws_secret_access_key=aws_secret_access_key)

    all_s3_objects_gen = get_all_s3_objects(s3_client, Bucket=aws_bucket)

    for obj in all_s3_objects_gen:
        source = obj['Key']
        if source.startswith(s3_folder):
            destination = path.join(local_dir, source)
            if not path.exists(path.dirname(destination)):
                makedirs(path.dirname(destination))
            try:
                s3_client.download_file(aws_bucket, source, destination)
            except (ClientError, S3TransferFailedError) as e:
                print('[ERROR] Could not download file "%s": %s' % (source, e))
                success = False
            if debug_en:
                print('[DEBUG] Downloading: %s --> %s' % (source, destination))

    return success
Shahar Gino
  • 155
  • 3
  • 11
1

I had some problems with this version. Modified variable destination and included variable for filter to files type.

from sre_constants import SUCCESS
import boto3 
from os import path, makedirs 
from botocore.exceptions import ClientError
from boto3.exceptions import S3TransferFailedError

def download_s3_folder(s3_folder, local_dir, aws_access_key_id, aws_secret_access_key, aws_bucket, debug_en, datatype):
""""" Download the contents of a folder directory into a local area """""

success = True
# Start do processo de copia
print('[INFO] Downloading %s from bucket %s...' % (s3_folder, aws_bucket))

# Metodo que lista todos os objetos do Bucket. 
def get_all_s3_objects(s3, **base_kwargs):
    continuation_token = None
    while True:
        list_kwargs = dict(MaxKeys=1000, **base_kwargs)
        if continuation_token:
            list_kwargs['ContinuationToken'] = continuation_token
        response = s3.list_objects_v2(**list_kwargs)
        yield from response.get('Contents', [])
        if not response.get('IsTruncated'):
            break
        continuation_token = response.get('NextContinuationToken')

s3_client = boto3.client('s3',
                         aws_access_key_id=aws_access_key_id,
                         aws_secret_access_key=aws_secret_access_key)

all_s3_objects_gen = get_all_s3_objects(s3_client, Bucket=aws_bucket)

# Loop into os objetos do S3,
for obj in all_s3_objects_gen:
    source = obj['Key']
    if source.startswith(s3_folder):
        # Transform path to using fo SO
        destination = path.join(local_dir,source).replace('/','\\')
        
        if not path.exists(path.dirname(destination)):
            makedirs(path.dirname(destination))
        try:
            #print('copy')
            if destination.endswith(datatype):
                #print(destination)
                print('Copia do arquivo "%s" Sucesso' % (destination))
                s3_client.download_file(aws_bucket, source, destination)
        except (ClientError, S3TransferFailedError) as e:
            print('[ERROR] Could not download file "%s": %s' % (source, e))
            success = False
        if debug_en:
            print(f"[DEBUG] Downloading: {source} --> {destination}")

return success
1

Here I've written a script to download any files with any extension (.csv in the code), you can change the file extension according to the type of files you need to download

import boto3
import os
import shutil

session = boto3.Session(
    aws_access_key_id='',
    aws_secret_access_key='',
)


def download_directory(bucket_name, s3_folder_name):
    s3_resource = session.resource('s3')
    bucket = s3_resource.Bucket(bucket_name)
    objs = list(bucket.objects.filter(Prefix=s3_folder_name))
    for obj in objs:
        print("Try to Downloading " + obj.key)
        if not os.path.exists(os.path.dirname(obj.key)):
            os.makedirs(os.path.dirname(obj.key))
        out_name = obj.key.split('/')[-1]
        if out_name[-4:] == ".csv":
            bucket.download_file(obj.key, out_name)
            print(f"Downloaded {out_name}")
            dest_path = ('/').join(obj.key.split('/')[0:-1])
            shutil.move(out_name, dest_path)
            print(f"Moved File to {dest_path}")
        else:
            print(f"Skipping {out_name}")


download_directory("mybucket", "myfolder")

Please feel free to ask me for help if you can't understand what to do exactly.

0

Here is my approach inspired by konstantinos-katsantonis and bjc answers.

import os
import boto3
from pathlib import Path

def download_s3_dir(bucketName, remote_dir, local_dir):
    assert remote_dir.endswith('/')
    assert local_dir.endswith('/')
    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(bucketName) 
    objs = bucket.objects.filter(Prefix=remote_dir)
    sorted_objs = sorted(objs, key=attrgetter("key"))
    for obj in sorted_objs:
        path = Path(os.path.dirname(local_dir + obj.key))
        path.mkdir(parents=True, exist_ok=True)
        if not obj.key.endswith("/"):
            bucket.download_file(obj.key, str(path) + "/" + os.path.split(obj.key)[1])
Greg7000
  • 297
  • 3
  • 15
  • It does not work for me. I get: `AssertionError Traceback (most recent call last) Input In [34], in () ----> 1 download_s3_dir(bucket_name, remote_folder_name, local_path) Input In [23], in download_s3_dir(bucketName, remote_dir, local_dir) 5 def download_s3_dir(bucketName, remote_dir, local_dir): 6 assert remote_dir.endswith('/') ----> 7 assert local_dir.endswith('/') 8 s3_resource = boto3.resource('s3') 9 bucket = s3_resource.Bucket(bucketName) AssertionError:` – user88484 Jul 28 '22 at 13:50
  • @user88484, mare sure your remote_dir and loca_dir endswith a '/' – Greg7000 Aug 01 '22 at 11:28