127

I'm using boto3 to get files from s3 bucket. I need a similar functionality like aws s3 sync

My current code is

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='my_bucket_name')['Contents']
for key in list:
    s3.download_file('my_bucket_name', key['Key'], key['Key'])

This is working fine, as long as the bucket has only files. If a folder is present inside the bucket, its throwing an error

Traceback (most recent call last):
  File "./test", line 6, in <module>
    s3.download_file('my_bucket_name', key['Key'], key['Key'])
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 58, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 651, in download_file
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 666, in _download_file
    self._get_object(bucket, key, filename, extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 690, in _get_object
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 707, in _do_get_object
    with self._osutil.open(filename, 'wb') as f:
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 323, in open
    return open(filename, mode)
IOError: [Errno 2] No such file or directory: 'my_folder/.8Df54234'

Is this a proper way to download a complete s3 bucket using boto3. How to download folders.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Shan
  • 2,141
  • 2
  • 17
  • 32

18 Answers18

87

I have the same needs and created the following function that download recursively the files.

The directories are created locally only if they contain files.

import boto3
import os

def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_dir(client, resource, subdir.get('Prefix'), local, bucket)
        for file in result.get('Contents', []):
            dest_pathname = os.path.join(local, file.get('Key'))
            if not os.path.exists(os.path.dirname(dest_pathname)):
                os.makedirs(os.path.dirname(dest_pathname))
            if not file.get('Key').endswith('/'):
                resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)

The function is called that way:

def _start():
    client = boto3.client('s3')
    resource = boto3.resource('s3')
    download_dir(client, resource, 'clientconf/', '/tmp', bucket='my-bucket')
ignoring_gravity
  • 6,677
  • 4
  • 32
  • 65
glefait
  • 1,651
  • 1
  • 13
  • 11
  • Seems a nice way than my current one. I will try this. Thanks @glefait – Shan Oct 26 '15 at 18:06
  • 1
    This seems also the recommended way of Amazon as abstractly stated [here](http://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html). – lony Apr 26 '16 at 08:51
  • I have a manual created folder which sadly is shown as a "Key" instead of just a prefix. seems @Shan method has to be combined with this one. – lony Apr 26 '16 at 09:16
  • 9
    I don't think you need to create a resource and a client. I believe a client is always available on the resource. You can just use `resource.meta.client`. – theherk Jul 05 '16 at 21:25
  • 2
    I think that should be "download_dir(client, resource, subdir.get('Prefix'), local, **bucket**)" – rm999 Oct 26 '16 at 19:40
  • 7
    I was getting an `OSError: [Errno 21] Is a directory` so i wrapped the call to download_file with `if not file.get('Key').endswith('/')` to resolve. Thank you @glefait and @Shan – user336828 Apr 20 '17 at 23:34
  • 6
    Isn't there an equivalent of aws-cli command `aws s3 sync` available in boto3 library? – greperror Jul 07 '17 at 16:34
  • @greperror this seems like the stupidest oversight by the S3 team. – W4t3randWind Apr 13 '18 at 14:32
  • 11
    What is `dist` here? – Rob Rose Jul 10 '18 at 01:59
  • 2
    Wouldn't it be better to call get_paginator once, and pass the paginator into download_dir() ? – Chris Card Jul 20 '18 at 11:42
  • @RobRose it's the path to your folder in S3 – ignoring_gravity Jan 26 '22 at 15:42
76

When working with buckets that have 1000+ objects its necessary to implement a solution that uses the NextContinuationToken on sequential sets of, at most, 1000 keys. This solution first compiles a list of objects then iteratively creates the specified directories and downloads the existing objects.

import boto3
import os

s3_client = boto3.client('s3')

def download_dir(prefix, local, bucket, client=s3_client):
    """
    params:
    - prefix: pattern to match in s3
    - local: local path to folder in which to place files
    - bucket: s3 bucket with target contents
    - client: initialized s3 client object
    """
    keys = []
    dirs = []
    next_token = ''
    base_kwargs = {
        'Bucket':bucket,
        'Prefix':prefix,
    }
    while next_token is not None:
        kwargs = base_kwargs.copy()
        if next_token != '':
            kwargs.update({'ContinuationToken': next_token})
        results = client.list_objects_v2(**kwargs)
        contents = results.get('Contents')
        for i in contents:
            k = i.get('Key')
            if k[-1] != '/':
                keys.append(k)
            else:
                dirs.append(k)
        next_token = results.get('NextContinuationToken')
    for d in dirs:
        dest_pathname = os.path.join(local, d)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
    for k in keys:
        dest_pathname = os.path.join(local, k)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
        client.download_file(bucket, k, dest_pathname)
Grant Langseth
  • 1,527
  • 12
  • 6
  • changing this to accepted answer as it handle wider use case. Thanks Grant – Shan May 25 '19 at 04:48
  • 2
    my code goes into an infinite loop at `while next_token is not None:` – gpd Aug 12 '19 at 11:10
  • 1
    @gpd this shouldn’t happen as the boto3 client will return a page without the NextContinuationToken when it has reached the last page, exiting the while statement. If you paste the last response you get from using the boto3 API (whatever is stored in the response variable) then I think it will be more clear what is happening in your specific case. Try printing out the ‘results’ variable just to test. My guess is that you have given a prefix object that doesn’t match any contents of your bucket. Did you check that? – Grant Langseth Aug 12 '19 at 14:23
  • 1
    Note that you would need minor changes to make it work with Digital Ocean. as explained [here](https://www.digitalocean.com/community/questions/can-t-list-all-keys-on-spaces-no-continuation-token) – David Dahan Dec 12 '19 at 18:03
  • 3
    Using this code I am getting this error: 'NoneType' object is not iterable: TypeError – NJones Aug 04 '20 at 08:36
  • Which was the case because my Prefix was wrong! Thx this works great ;) – NJones Aug 04 '20 at 08:40
  • Before creating the directories and downloading the files add the following two lines dirs.reverse() keys.reverse() End of the while loop – sherin Nov 20 '20 at 11:54
  • Any reason for reverse order as opposed to a forward sort or no sort at all? @sherin – Grant Langseth Nov 20 '20 at 15:16
  • i have seen an issue with DOT file name and folder names and same. – sherin Nov 23 '20 at 03:47
  • 2
    Great post - but it would be clearer to split this into the trickier listing of all files and the actual rather simple downloading. – gebbissimo Jun 05 '22 at 17:49
  • I got stuck in an infinite loop until I replace while token is not None with while token – Sven Aug 13 '23 at 12:02
64
import os
import boto3

#initiate s3 resource
s3 = boto3.resource('s3')

# select bucket
my_bucket = s3.Bucket('my_bucket_name')

# download file into current directory
for s3_object in my_bucket.objects.all():
    # Need to split s3_object.key into path and file name, else it will give error file not found.
    path, filename = os.path.split(s3_object.key)
    my_bucket.download_file(s3_object.key, filename)
Joe Haddad
  • 321
  • 6
  • 12
Tushar Niras
  • 3,654
  • 2
  • 22
  • 24
  • 4
    Clean and simple, any reason why not to use this? It's much more understandable than all the other solutions. Collections seem to do a lot of things for you in the background. – Joost Oct 05 '17 at 14:35
  • 4
    I guess you should first create all subfolders in order to have this working properly. – rpanai Mar 02 '18 at 13:48
  • 11
    This code will put everything in the top-level output directory regardless of how deeply nested it is in S3. And if multiple files have the same name in different directories, it will stomp on one with another. I think you need one more line: `os.makedirs(path)`, and then the download destination should be `object.key`. – Scott Smith Jan 07 '19 at 07:16
  • This is the easier to read solution! FYI: I worried that it might only read the first 1000 objects, but it seems to really get all. – gebbissimo Jun 05 '22 at 17:57
  • 1
    @TusharNiras: Could you add `my_bucket.objects.filter(Prefix="").all()` to only download files with a certain prefix? – gebbissimo Jun 05 '22 at 18:03
55

Amazon S3 does not have folders/directories. It is a flat file structure.

To maintain the appearance of directories, path names are stored as part of the object Key (filename). For example:

  • images/foo.jpg

In this case, the whole Key is images/foo.jpg, rather than just foo.jpg.

I suspect that your problem is that boto is returning a file called my_folder/.8Df54234 and is attempting to save it to the local filesystem. However, your local filesystem interprets the my_folder/ portion as a directory name, and that directory does not exist on your local filesystem.

You could either truncate the filename to only save the .8Df54234 portion, or you would have to create the necessary directories before writing files. Note that it could be multi-level nested directories.

An easier way would be to use the AWS Command-Line Interface (CLI), which will do all this work for you, eg:

aws s3 cp --recursive s3://my_bucket_name local_folder

There's also a sync option that will only copy new and modified files.

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • 1
    @j I understand that. But i needed the folder to be created, automatically just like `aws s3 sync`. Is it possible in boto3. – Shan Aug 11 '15 at 07:54
  • 5
    You would have to include the creation of a directory as part of your Python code. If the Key contains a directory (eg `foo/bar.txt`), you would be responsible for creating the directory (`foo`) before calling `s3.download_file`. It is not an automatic capability of `boto`. – John Rotenstein Aug 11 '15 at 12:40
  • Here, the content of the S3 bucket is dynamic, so i have to check `s3.list_objects(Bucket='my_bucket_name')['Contents']` and filter for folder keys and create them. – Shan Aug 12 '15 at 07:08
  • 3
    After playing around with Boto3 for a while, AWS CLI command listed here is definitely the easiest way to do this. – AdjunctProfessorFalcon Sep 10 '16 at 19:44
  • Maybe a stupid question, but what is the ".8Df54234" part? It keeps appending something like that to my local file name and throwing an error. – Ben Jun 07 '18 at 14:04
  • 1
    @Ben Please start a new Question rather than asking a question as a comment on an old (2015) question. – John Rotenstein Jun 07 '18 at 20:42
17

I'm currently achieving the task, by using the following

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='bucket')['Contents']
for s3_key in list:
    s3_object = s3_key['Key']
    if not s3_object.endswith("/"):
        s3.download_file('bucket', s3_object, s3_object)
    else:
        import os
        if not os.path.exists(s3_object):
            os.makedirs(s3_object)

Although, it does the job, I'm not sure its good to do this way. I'm leaving it here to help other users and further answers, with better manner of achieving this

Shan
  • 2,141
  • 2
  • 17
  • 32
13

Better late than never:) The previous answer with paginator is really good. However it is recursive, and you might end up hitting Python's recursion limits. Here's an alternate approach, with a couple of extra checks.

import os
import errno
import boto3


def assert_dir_exists(path):
    """
    Checks if directory tree in path exists. If not it created them.
    :param path: the path to check if it exists
    """
    try:
        os.makedirs(path)
    except OSError as e:
        if e.errno != errno.EEXIST:
            raise


def download_dir(client, bucket, path, target):
    """
    Downloads recursively the given S3 path to the target directory.
    :param client: S3 client to use.
    :param bucket: the name of the bucket to download from
    :param path: The S3 directory to download.
    :param target: the local directory to download the files to.
    """

    # Handle missing / at end of prefix
    if not path.endswith('/'):
        path += '/'

    paginator = client.get_paginator('list_objects_v2')
    for result in paginator.paginate(Bucket=bucket, Prefix=path):
        # Download each file individually
        for key in result['Contents']:
            # Calculate relative path
            rel_path = key['Key'][len(path):]
            # Skip paths ending in /
            if not key['Key'].endswith('/'):
                local_file_path = os.path.join(target, rel_path)
                # Make sure directories exist
                local_file_dir = os.path.dirname(local_file_path)
                assert_dir_exists(local_file_dir)
                client.download_file(bucket, key['Key'], local_file_path)


client = boto3.client('s3')

download_dir(client, 'bucket-name', 'path/to/data', 'downloads')
ifoukarakis
  • 1,199
  • 8
  • 8
  • 1
    Got `KeyError: 'Contents'` . input path `'/arch/R/storeincomelogs/` , full path `/arch/R/storeincomelogs/201901/01/xxx.parquet` . – Mithril Jan 29 '19 at 03:41
  • > Got KeyError: 'Contents' `Contents` will not be present when the provided prefix/path does not have any files. Adding `if 'Contents' not in result: continue` should solve the problem but I would check the use-case prior to making that change. – sant parkash singh Nov 03 '21 at 00:09
6

A lot of the solutions here get pretty complicated. If you're looking for something simpler, cloudpathlib wraps things in a nice way for this use case that will download directories or files.

from cloudpathlib import CloudPath

cp = CloudPath("s3://bucket/product/myproject/2021-02-15/")
cp.download_to("local_folder")

Note: for large folders with lots of files, awscli at the command line is likely faster.

hume
  • 2,413
  • 19
  • 21
  • 3
    This was really sweet and simple. Just to complete this answer. install cloudpathlib `pip install cloudpathlib[s3]` – Samual Oct 03 '21 at 16:37
4

I have a workaround for this that runs the AWS CLI in the same process.

Install awscli as python lib:

pip install awscli

Then define this function:

from awscli.clidriver import create_clidriver

def aws_cli(*cmd):
    old_env = dict(os.environ)
    try:

        # Environment
        env = os.environ.copy()
        env['LC_CTYPE'] = u'en_US.UTF'
        os.environ.update(env)

        # Run awscli in the same process
        exit_code = create_clidriver().main(*cmd)

        # Deal with problems
        if exit_code > 0:
            raise RuntimeError('AWS CLI exited with code {}'.format(exit_code))
    finally:
        os.environ.clear()
        os.environ.update(old_env)

To execute:

aws_cli('s3', 'sync', '/path/to/source', 's3://bucket/destination', '--delete')
mattalxndr
  • 9,143
  • 8
  • 56
  • 87
  • I used the same idea but without using the `sync` command, and rather simply executing the command `aws s3 cp s3://{bucket}/{folder} {local_folder} --recursive`. Times reduced from minutes (almost 1h) to literally seconds – acaruci Nov 08 '19 at 20:00
  • I'm using this code but have an issue where all the debug logs are showing. I have this declared globally: `logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.WARNING) logger = logging.getLogger()` and only want logs to be outputted from root. Any ideas? – April Polubiec Dec 05 '19 at 19:36
2

I've updated Grant's answer to run in parallel, it's much faster if anyone is interested:

from concurrent import futures
import os
import boto3

def download_dir(prefix, local, bucket):

    client = boto3.client('s3')

    def create_folder_and_download_file(k):
        dest_pathname = os.path.join(local, k)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
        print(f'downloading {k} to {dest_pathname}')
        client.download_file(bucket, k, dest_pathname)

    keys = []
    dirs = []
    next_token = ''
    base_kwargs = {
        'Bucket': bucket,
        'Prefix': prefix,
    }
    while next_token is not None:
        kwargs = base_kwargs.copy()
        if next_token != '':
            kwargs.update({'ContinuationToken': next_token})
        results = client.list_objects_v2(**kwargs)
        contents = results.get('Contents')
        for i in contents:
            k = i.get('Key')
            if k[-1] != '/':
                keys.append(k)
            else:
                dirs.append(k)
        next_token = results.get('NextContinuationToken')
    for d in dirs:
        dest_pathname = os.path.join(local, d)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
    with futures.ThreadPoolExecutor() as executor:
        futures.wait(
            [executor.submit(create_folder_and_download_file, k) for k in keys],
            return_when=futures.FIRST_EXCEPTION,
        )
Utkarsh Dalal
  • 229
  • 3
  • 8
2

Yet another parallel downloader using asyncio/aioboto

import os, time
import asyncio
from itertools import chain
import json
from typing import List
from json.decoder import WHITESPACE
import logging
from functools import partial
from pprint import pprint as pp

# Third Party
import asyncpool
import aiobotocore.session
import aiobotocore.config

_NUM_WORKERS = 50


bucket_name= 'test-data'
bucket_prefix= 'etl2/test/20210330/f_api'


async def save_to_file(s3_client, bucket: str, key: str):
    
    response = await s3_client.get_object(Bucket=bucket, Key=key)
    async with response['Body'] as stream:
        content = await stream.read()
    
    if 1:
        fn =f'out/downloaded/{bucket_name}/{key}'

        dn= os.path.dirname(fn)
        if not isdir(dn):
            os.makedirs(dn,exist_ok=True)
        if 1:
            with open(fn, 'wb') as fh:
                fh.write(content)
                print(f'Downloaded to: {fn}')
   
    return [0]


async def go(bucket: str, prefix: str) -> List[dict]:
    """
    Returns list of dicts of object contents

    :param bucket: s3 bucket
    :param prefix: s3 bucket prefix
    :return: list of download statuses
    """
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger()

    session = aiobotocore.session.AioSession()
    config = aiobotocore.config.AioConfig(max_pool_connections=_NUM_WORKERS)
    contents = []
    async with session.create_client('s3', config=config) as client:
        worker_co = partial(save_to_file, client, bucket)
        async with asyncpool.AsyncPool(None, _NUM_WORKERS, 's3_work_queue', logger, worker_co,
                                       return_futures=True, raise_on_join=True, log_every_n=10) as work_pool:
            # list s3 objects using paginator
            paginator = client.get_paginator('list_objects')
            async for result in paginator.paginate(Bucket=bucket, Prefix=prefix):
                for c in result.get('Contents', []):
                    contents.append(await work_pool.push(c['Key'], client))

    # retrieve results from futures
    contents = [c.result() for c in contents]
    return list(chain.from_iterable(contents))


def S3_download_bucket_files():
    s = time.perf_counter()
    _loop = asyncio.get_event_loop()
    _result = _loop.run_until_complete(go(bucket_name, bucket_prefix))
    assert sum(_result)==0, _result
    print(_result)
    elapsed = time.perf_counter() - s
    print(f"{__file__} executed in {elapsed:0.2f} seconds.")

It will fetch list of files from S3 first and then download using aioboto with _NUM_WORKERS=50 reading data in parallel from the network.

Alex B
  • 2,165
  • 2
  • 27
  • 37
1

It is a very bad idea to get all files in one go, you should rather get it in batches.

One implementation which I use to fetch a particular folder (directory) from S3 is,

def get_directory(directory_path, download_path, exclude_file_names):
    # prepare session
    session = Session(aws_access_key_id, aws_secret_access_key, region_name)
    
    # get instances for resource and bucket
    resource = session.resource('s3')
    bucket = resource.Bucket(bucket_name)

    for s3_key in self.client.list_objects(Bucket=self.bucket_name, Prefix=directory_path)['Contents']:
        s3_object = s3_key['Key']
        if s3_object not in exclude_file_names:
            bucket.download_file(file_path, download_path + str(s3_object.split('/')[-1])

and still if you want to get the whole bucket use it via CLI as @John Rotenstein mentioned as below,

aws s3 cp --recursive s3://bucket_name download_path
shiva
  • 2,535
  • 2
  • 18
  • 32
Ganatra
  • 6,498
  • 3
  • 17
  • 16
1

If you want to call a bash script using python, here is a simple method to load a file from a folder in S3 bucket to a local folder (in a Linux machine) :

import boto3
import subprocess
import os

###TOEDIT###
my_bucket_name = "your_my_bucket_name"
bucket_folder_name = "your_bucket_folder_name"
local_folder_path = "your_local_folder_path"
###TOEDIT###

# 1.Load thes list of files existing in the bucket folder
FILES_NAMES = []
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('{}'.format(my_bucket_name))
for object_summary in my_bucket.objects.filter(Prefix="{}/".format(bucket_folder_name)):
#     print(object_summary.key)
    FILES_NAMES.append(object_summary.key)

# 2.List only new files that do not exist in local folder (to not copy everything!)
new_filenames = list(set(FILES_NAMES )-set(os.listdir(local_folder_path)))

# 3.Time to load files in your destination folder 
for new_filename in new_filenames:
    upload_S3files_CMD = """aws s3 cp s3://{}/{}/{} {}""".format(my_bucket_name,bucket_folder_name,new_filename ,local_folder_path)

    subprocess_call = subprocess.call([upload_S3files_CMD], shell=True)
    if subprocess_call != 0:
        print("ALERT: loading files not working correctly, please re-check new loaded files")
HazimoRa3d
  • 517
  • 5
  • 12
1

From AWS S3 Docs (How do I use folders in an S3 bucket?):

In Amazon S3, buckets and objects are the primary resources, and objects are stored in buckets. Amazon S3 has a flat structure instead of a hierarchy like you would see in a file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects. Amazon S3 does this by using a shared name prefix for objects (that is, objects have names that begin with a common string). Object names are also referred to as key names.

For example, you can create a folder on the console named photos and store an object named myphoto.jpg in it. The object is then stored with the key name photos/myphoto.jpg, where photos/ is the prefix.

To download all files from "mybucket" into the current directory respecting the bucket's emulated directory structure (creating the folders from the bucket if they don't already exist locally):

import boto3
import os

bucket_name = "mybucket"
s3 = boto3.client("s3")
objects = s3.list_objects(Bucket = bucket_name)["Contents"]
for s3_object in objects:
    s3_key = s3_object["Key"]
    path, filename = os.path.split(s3_key)
    if len(path) != 0 and not os.path.exists(path):
        os.makedirs(path)
    if not s3_key.endswith("/"):
        download_to = path + '/' + filename if path else filename
        s3.download_file(bucket_name, s3_key, download_to)
Daria
  • 31
  • 4
0
for objs in my_bucket.objects.all():
    print(objs.key)
    path='/tmp/'+os.sep.join(objs.key.split(os.sep)[:-1])
    try:
        if not os.path.exists(path):
            os.makedirs(path)
        my_bucket.download_file(objs.key, '/tmp/'+objs.key)
    except FileExistsError as fe:                          
        print(objs.key+' exists')

This code will download the content in /tmp/ directory. If you want you can change the directory.

Rajesh Rajendran
  • 587
  • 5
  • 18
0

I got the similar requirement and got help from reading few of the above solutions and across other websites, I have came up with below script, Just wanted to share if it might help anyone.

from boto3.session import Session
import os

def sync_s3_folder(access_key_id,secret_access_key,bucket_name,folder,destination_path):    
    session = Session(aws_access_key_id=access_key_id,aws_secret_access_key=secret_access_key)
    s3 = session.resource('s3')
    your_bucket = s3.Bucket(bucket_name)
    for s3_file in your_bucket.objects.all():
        if folder in s3_file.key:
            file=os.path.join(destination_path,s3_file.key.replace('/','\\'))
            if not os.path.exists(os.path.dirname(file)):
                os.makedirs(os.path.dirname(file))
            your_bucket.download_file(s3_file.key,file)
sync_s3_folder(access_key_id,secret_access_key,bucket_name,folder,destination_path)
Kranti
  • 36
  • 5
0

Reposting @glefait 's answer with an if condition at the end to avoid os error 20. The first key it gets is the folder name itself which cannot be written in the destination path.

def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_dir(client, resource, subdir.get('Prefix'), local, bucket)
        for file in result.get('Contents', []):
            print("Content: ",result)
            dest_pathname = os.path.join(local, file.get('Key'))
            print("Dest path: ",dest_pathname)
            if not os.path.exists(os.path.dirname(dest_pathname)):
                print("here last if")
                os.makedirs(os.path.dirname(dest_pathname))
            print("else file key: ", file.get('Key'))
            if not file.get('Key') == dist:
                print("Key not equal? ",file.get('Key'))
                resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)enter code here
vinay
  • 49
  • 8
0

I have been running into this problem for a while and with all of the different forums I've been through I haven't see a full end-to-end snip-it of what works. So, I went ahead and took all the pieces (add some stuff on my own) and have created a full end-to-end S3 Downloader!

This will not only download files automatically but if the S3 files are in subdirectories, it will create them on the local storage. In my application's instance, I need to set permissions and owners so I have added that too (can be comment out if not needed).

This has been tested and works in a Docker environment (K8) but I have added the environmental variables in the script just in case you want to test/run it locally.

I hope this helps someone out in their quest of finding S3 Download automation. I also welcome any advice, info, etc. on how this can be better optimized if needed.

#!/usr/bin/python3
import gc
import logging
import os
import signal
import sys
import time
from datetime import datetime

import boto
from boto.exception import S3ResponseError
from pythonjsonlogger import jsonlogger

formatter = jsonlogger.JsonFormatter('%(message)%(levelname)%(name)%(asctime)%(filename)%(lineno)%(funcName)')

json_handler_out = logging.StreamHandler()
json_handler_out.setFormatter(formatter)

#Manual Testing Variables If Needed
#os.environ["DOWNLOAD_LOCATION_PATH"] = "some_path"
#os.environ["BUCKET_NAME"] = "some_bucket"
#os.environ["AWS_ACCESS_KEY"] = "some_access_key"
#os.environ["AWS_SECRET_KEY"] = "some_secret"
#os.environ["LOG_LEVEL_SELECTOR"] = "DEBUG, INFO, or ERROR"

#Setting Log Level Test
logger = logging.getLogger('json')
logger.addHandler(json_handler_out)
logger_levels = {
    'ERROR' : logging.ERROR,
    'INFO' : logging.INFO,
    'DEBUG' : logging.DEBUG
}
logger_level_selector = os.environ["LOG_LEVEL_SELECTOR"]
logger.setLevel(logger_level_selector)

#Getting Date/Time
now = datetime.now()
logger.info("Current date and time : ")
logger.info(now.strftime("%Y-%m-%d %H:%M:%S"))

#Establishing S3 Variables and Download Location
download_location_path = os.environ["DOWNLOAD_LOCATION_PATH"]
bucket_name = os.environ["BUCKET_NAME"]
aws_access_key_id = os.environ["AWS_ACCESS_KEY"]
aws_access_secret_key = os.environ["AWS_SECRET_KEY"]
logger.debug("Bucket: %s" % bucket_name)
logger.debug("Key: %s" % aws_access_key_id)
logger.debug("Secret: %s" % aws_access_secret_key)
logger.debug("Download location path: %s" % download_location_path)

#Creating Download Directory
if not os.path.exists(download_location_path):
    logger.info("Making download directory")
    os.makedirs(download_location_path)

#Signal Hooks are fun
class GracefulKiller:
    kill_now = False
    def __init__(self):
        signal.signal(signal.SIGINT, self.exit_gracefully)
        signal.signal(signal.SIGTERM, self.exit_gracefully)
    def exit_gracefully(self, signum, frame):
        self.kill_now = True

#Downloading from S3 Bucket
def download_s3_bucket():
    conn = boto.connect_s3(aws_access_key_id, aws_access_secret_key)
    logger.debug("Connection established: ")
    bucket = conn.get_bucket(bucket_name)
    logger.debug("Bucket: %s" % str(bucket))
    bucket_list = bucket.list()
#    logger.info("Number of items to download: {0}".format(len(bucket_list)))

    for s3_item in bucket_list:
        key_string = str(s3_item.key)
        logger.debug("S3 Bucket Item to download: %s" % key_string)
        s3_path = download_location_path + "/" + key_string
        logger.debug("Downloading to: %s" % s3_path)
        local_dir = os.path.dirname(s3_path)

        if not os.path.exists(local_dir):
            logger.info("Local directory doesn't exist, creating it... %s" % local_dir)
            os.makedirs(local_dir)
            logger.info("Updating local directory permissions to %s" % local_dir)
#Comment or Uncomment Permissions based on Local Usage
            os.chmod(local_dir, 0o775)
            os.chown(local_dir, 60001, 60001)
        logger.debug("Local directory for download: %s" % local_dir)
        try:
            logger.info("Downloading File: %s" % key_string)
            s3_item.get_contents_to_filename(s3_path)
            logger.info("Successfully downloaded File: %s" % s3_path)
            #Updating Permissions
            logger.info("Updating Permissions for %s" % str(s3_path))
#Comment or Uncomment Permissions based on Local Usage
            os.chmod(s3_path, 0o664)
            os.chown(s3_path, 60001, 60001)
        except (OSError, S3ResponseError) as e:
            logger.error("Fatal error in s3_item.get_contents_to_filename", exc_info=True)
            # logger.error("Exception in file download from S3: {}".format(e))
            continue
        logger.info("Deleting %s from S3 Bucket" % str(s3_item.key))
        s3_item.delete()

def main():
    killer = GracefulKiller()
    while not killer.kill_now:
        logger.info("Checking for new files on S3 to download...")
        download_s3_bucket()
        logger.info("Done checking for new files, will check in 120s...")
        gc.collect()
        sys.stdout.flush()
        time.sleep(120)
if __name__ == '__main__':
    main()
Comrade35
  • 1
  • 1
0

There are very minor differences in the way S3 organizes files and the way Windows does. Here is a simple self-documenting example that accounts for those differences.

Also: Think of amazon file names as a normal string. They don't really represent a folder. Amazon SIMULATES folders, so if you try to just shove a file into a NAME of a folder that doesn't exist on your system, it cannot figure out where to place it. So you must MAKE a folder on your system for each simulated folder from S3. If you have a folder within a folder, don't use "mkdir(path)" it won't work. You have to use "makedirs(path)". ANOTHER THING! -> PC file paths are weirdly formatted. Amazon uses "/" and pc uses "\" and it MUST be the same for the whole file name. Check out my code block below (WHICH SHOWS AUTHENTICATION TOO).

In my example, I have a folder in my bucket called "iTovenGUIImages/gui_media". I want to put it in a folder on my system that MAY not exist yet. The folder on my system has it's own special prefix that can be whatever you want in your system as long as it's formatted like a folder path.

import boto3
import cred
import os

locale_file_Imagedirectory = r"C:\\Temp\\App Data\\iToven AI\\"  # This is where all GUI files for iToven AI exist on PC


def downloadImageDirectoryS3(remoteDirectoryName, desired_parent_folder):
    my_bucket = 'itovenbucket'
    s3_resource = boto3.resource('s3', aws_access_key_id=cred.AWSAccessKeyId,
                                 aws_secret_access_key=cred.AWSSecretKey)
    bucket = s3_resource.Bucket(my_bucket)
    for obj in bucket.objects.filter(Prefix=remoteDirectoryName):
        pcVersionPrefix = remoteDirectoryName.replace("/", r"\\")
        isolatedFileName = obj.key.replace(remoteDirectoryName, "")
        clientSideFileName = desired_parent_folder+pcVersionPrefix+isolatedFileName
        print(clientSideFileName)  # Client-Side System File Structure
        if not os.path.exists(desired_parent_folder+pcVersionPrefix):  # CREATE DIRECTORIES FOR EACH FOLDER RECURSIVELY
            os.makedirs(desired_parent_folder+pcVersionPrefix)
        if obj.key not in desired_parent_folder+pcVersionPrefix:
            bucket.download_file(obj.key, clientSideFileName)  # save to new path


downloadImageDirectoryS3(r"iTovenGUIImagesPC/gui_media/", locale_file_Imagedirectory)
Arthur Lee
  • 10
  • 4