Track download progress of S3 file using boto3 and callbacks

Question

I am trying to download a text file from S3 using boto3.

Here is what I have written.

class ProgressPercentage(object):
    def __init__(self, filename):
        self._filename = filename
        self._size = float(os.path.getsize(filename))
        self._seen_so_far = 0
        self._lock = threading.Lock()

    def __call__(self, bytes_amount):
        # To simplify we'll assume this is hooked up
        # to a single filename.
        with self._lock:
            self._seen_so_far += bytes_amount
            percentage = round((self._seen_so_far / self._size) * 100,2)
            LoggingFile('{} is the file name. {} out of {} done. The percentage completed is {} %'.format(str(self._filename), str(self._seen_so_far), str(self._size),str(percentage)))
            sys.stdout.flush()

and I am calling it using

transfer.download_file(BUCKET_NAME,FILE_NAME,'{}{}'.format(LOCAL_PATH_TEMP , FILE_NAME),callback = ProgressPercentage(LOCAL_PATH_TEMP + FILE_NAME))

this is giving me a error that file is not present in the folder. Apparently when I already have a file with this name in the same folder it works but when I am downloading a fresh file , it errors out.

What is correction I need to make?

you are also missing `sys`, `threading` and what is `LoggingFile`? — Naz, Aug 19 '19 at 20:28
Just use this :) https://alexwlchan.net/2021/04/s3-progress-bars/ — vgoklani, Jul 18 '22 at 15:06

Glen Thompson · Answer 1 · 2020-07-08T22:41:35.067

This is my implementation. No other dependencies, hack up the progress callback function to display whatever you want.

import sys
import boto3

s3_client = boto3.client('s3')

def download(local_file_name, s3_bucket, s3_object_key):

    meta_data = s3_client.head_object(Bucket=s3_bucket, Key=s3_object_key)
    total_length = int(meta_data.get('ContentLength', 0))
    downloaded = 0

    def progress(chunk):
        nonlocal downloaded
        downloaded += chunk
        done = int(50 * downloaded / total_length)
        sys.stdout.write("\r[%s%s]" % ('=' * done, ' ' * (50-done)) )
        sys.stdout.flush()

    print(f'Downloading {s3_object_key}')
    with open(local_file_name, 'wb') as f:
        s3_client.download_fileobj(s3_bucket, s3_object_key, f, Callback=progress)

e.g.

local_file_name = 'test.csv'
s3_bucket = 'my-bucket'
s3_object_key = 'industry/test.csv'

download(local_file_name, s3_bucket, s3_object_key)

Demo:

Tested with boto3>=1.14.19, python>=3.7

score 17 · Accepted Answer · answered Jan 25 '17 at 15:27

callback = ProgressPercentage(LOCAL_PATH_TEMP + FILE_NAME)) creates a ProgressPercentage object, runs its __init__ method, and passes the object as callback to the download_file method. This means the __init__ method is run before download_file begins.

In the __init__ method you are attempting to read the size of the local file being downloaded to, which throws an exception as the file does not exist since the download has yet to start. If you've already downloaded the file, then there's no problem since a local copy exists and its size can be read.

Of course, this is merely the cause of the exception you're seeing. You're using the _size property as the maximum value of download progress. However you're attempting to use the size of the local file. Until the file is completely downloaded, the local file system does not know how large the file is, it only knows how much space it takes up right now. This means as you download the file will gradually get bigger until it reaches its full size. As such, it doesn't really make sense to consider the size of the local file as the maximum size of the download. It may work in the case where you've already downloaded the file, but that isn't very useful.

The solution to your problem would be to check the size of the file you're going to download, instead of the size of the local copy. This ensures you're getting the actual size of whatever it is you're downloading, and that the file exists (as you couldn't be downloading it if it didn't). You can do this by getting the size of the remote file with head_object as follows

class ProgressPercentage(object):
    def __init__(self, client, bucket, filename):
        # ... everything else the same
        self._size = client.head_object(Bucket=bucket, Key=filename).ContentLength

    # ...

# If you still have the client object you could pass that directly 
# instead of transfer._manager._client
progress = ProgressPercentage(transfer._manager._client, BUCKET_NAME, FILE_NAME)
transfer.download_file(..., callback=progress)

As a final note, although you got the code from the Boto3 documentation, it didn't work because it was intended for file uploads. In that case the local file is the source and its existence guaranteed.

So... idk if it's just me but in the docs `1.9.96` the named argument is `callback` with a minus `c`. But in the code of the same version (downloaded via pip) I got capital C for this exact same argument /: me = confused. I'll post my code as an example below. — Boop, Feb 18 '19 at 10:35
great this works for me! only had to make one minor change. head_object returns a dictionary. `client.head_object(Bucket=bucket, Key=filename).get('ContentLength')` — Gustavo_fringe, Mar 22 '19 at 17:48
How would you go about displaying the progress of said upload or download as it is occurring, this only shows how to get the download percentage on a specific call? — ViaTech, Jul 17 '19 at 00:11
This is not an MVP. I do not understand how to use this code. `transfer` isn't defined as well. — Naz, Aug 19 '19 at 20:19
I have a little changes on the code and now it works! `client.Object(bucket, filename).get()['ContentLength']` where my "client" here is a `boto3 resource object`. I need to do this change since I create boto3 session object with accessKey/ accecssSecretKey. — appletabo, May 12 '21 at 07:21

marescab · Answer 3 · 2023-08-03T19:27:19.107

Here's another simple custom implementation using tqdm:

from tqdm import tqdm
import boto3

def s3_download(s3_bucket, s3_object_key, local_file_name, s3_client=boto3.client('s3')):
    meta_data = s3_client.head_object(Bucket=s3_bucket, Key=s3_object_key)
    total_length = int(meta_data.get('ContentLength', 0))
    with tqdm(total=total_length,  desc=f'source: s3://{s3_bucket}/{s3_object_key}', bar_format="{percentage:.1f}%|{bar:25} | {rate_fmt} | {desc}",  unit='B', unit_scale=True, unit_divisor=1024) as pbar:
        with open(local_file_name, 'wb') as f:
            s3_client.download_fileobj(s3_bucket, s3_object_key, f, Callback=pbar.update)

usage:

s3_download(bucket, key, local_file_name)

output:

100.0%|█████████████████████████ | 12.9MB/s | source: s3://bucket/key

Adam Kurkiewicz · Answer 4 · 2019-01-24T04:07:18.627

13

Install progressbar with pip3 install progressbar

import boto3, os
import progressbar

bucket_name = "<your-s3-bucket-name>"
folder_name = "<your-directory-name-locally>"
file_name = "<your-filename-locally>"
path = folder_name + "/" + file_name
s3 = boto3.client('s3', aws_access_key_id="<your_aws_access_key_id>", aws_secret_access_key="<your_aws_secret_access_key>")

statinfo = os.stat(file_name)

up_progress = progressbar.progressbar.ProgressBar(maxval=statinfo.st_size)

up_progress.start()

def upload_progress(chunk):
    up_progress.update(up_progress.currval + chunk)

s3.upload_file(file_name, bucket_name, path, Callback=upload_progress)

up_progress.finish()

edited Jan 24 '19 at 04:07

answered Dec 18 '18 at 03:53

Adam Kurkiewicz

1,526
1
15
34

Just getting the answer of @EmmanuelNK to work with most recent pip3 – Adam Kurkiewicz Dec 18 '18 at 03:54
1

`from hurry.filesize import size` is unused. – Evan Jan 22 '19 at 20:17
this is for uploading. the original question asks for downloading – DankMasterDan Feb 22 '22 at 22:44

Nguyen Van Duc · Answer 5 · 2020-01-22T08:47:21.053

Following the official document, it is not quite difficult to apply progress tracking (download_file and upload_file functions are similar). Here is the full code with some modifications to see the data size in preferred manner.

import logging
import boto3
from botocore.exceptions import ClientError
import os
import sys
import threading
import math 

ACCESS_KEY = 'xxx'
SECRET_KEY = 'xxx'
REGION_NAME= 'ap-southeast-1'

class ProgressPercentage(object):
    def __init__(self, filename, filesize):
        self._filename = filename
        self._size = filesize
        self._seen_so_far = 0
        self._lock = threading.Lock()

    def __call__(self, bytes_amount):
        def convertSize(size):
            if (size == 0):
                return '0B'
            size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
            i = int(math.floor(math.log(size,1024)))
            p = math.pow(1024,i)
            s = round(size/p,2)
            return '%.2f %s' % (s,size_name[i])

        # To simplify, assume this is hooked up to a single filename
        with self._lock:
            self._seen_so_far += bytes_amount
            percentage = (self._seen_so_far / self._size) * 100
            sys.stdout.write(
                "\r%s  %s / %s  (%.2f%%)        " % (
                    self._filename, convertSize(self._seen_so_far), convertSize(self._size),
                    percentage))
            sys.stdout.flush()


def download_file(file_name, object_name, bucket_name):
    # If S3 object_name was not specified, use file_name
    if object_name is None:
        object_name = file_name

    # Initialize s3 client
    s3_client = boto3.client(service_name="s3",
                aws_access_key_id=ACCESS_KEY,
                aws_secret_access_key=SECRET_KEY,
                region_name=REGION_NAME)
    try:
        response = s3_client.download_file(
            Bucket=bucket_name, 
            Key=object_name, 
            Filename=file_name,
            Callback=ProgressPercentage(file_name, (s3_client.head_object(Bucket=bucket_name, Key=object_name))["ContentLength"])
            )
    except ClientError as e:
        logging.error(e)
        return False
    return True

file_name = "./output.csv.gz"
bucket_name = "mybucket"
object_name = "result/output.csv.gz" 
download_file(file_name, object_name, bucket_name )

score 2 · Answer 6 · answered Jun 20 '17 at 09:47

The object client.head_object(Bucket=bucket, Key=filename) is a dict. The file size can be accessed using ['ContentLength'].

Hence the code:
self._size = client.head_object(Bucket=bucket, Key=filename).ContentLength
should become:
self._size = float(client.head_object(Bucket=bucket, Key=filename)['ContentLength'])

Then it works. Thanks!

score 1 · Answer 7 · answered Aug 16 '18 at 04:06

Someone may stumble upon this answer when trying to do this (As per the question title). The easiest way I know to show s3 upload progress:

import a progress bar library into your project. This is what I used: https://github.com/anler/progressbar

Then:

import progressbar
from hurry.filesize import size
import boto3

bucket = "my-bucket-name"
s3_client = boto3.resource('s3')
...
...

# you get the filesize from wherever you have the file on. your system maybe?
filesize = size(file) 

up_progress = progressbar.AnimatedProgressBar(end=filesize, width=50)
def upload_progress(chunk):
    up_progress + chunk # Notice! No len()
    up_progress.show_progress()
s3_client.meta.client.upload_file(file, bucket, s3_file_name, Callback=upload_progress)

The important thing to notice here is the use of the Callback parameter(capital C). It basically returns the number of bytes uploaded to s3. So if you know the original filesize, some simple math gets you a progress bar. You can then use any progress bar library.

Doesn't work with the version of progressbar I've installed with pip3. — Adam Kurkiewicz, Dec 18 '18 at 03:49
I should have mentioned I just put the lib in in my project directly with using pip3. For those wondering how: create a folder called `progressbar` and place it with the rest of your python libraries, inside it add an empty `__init__.py` file. Then add the `progressbar.py` file from the github repo. Then you import it normally into your project. — Emmanuel N K, Dec 19 '18 at 06:08

score 1 · Answer 8 · answered Feb 18 '19 at 10:54

Info

Credits to @Kshitij Marwah, @yummies and nicolas.f.g posts
Using boto3 1.9.96 (dl via pip)
Removed threading
Changed display format (rewrite line above until dl completed)
Posting because difference b/w online doc and downloaded package

code

class ProgressPercentage(object):
    def __init__(self, o_s3bucket, key_name):
        self._key_name = key_name
        boto_client = o_s3bucket.meta.client
        # ContentLength is an int
        self._size = boto_client.head_object(Bucket=o_s3bucket.name, Key=key_name)['ContentLength']
        self._seen_so_far = 0
        sys.stdout.write('\n')

    def __call__(self, bytes_amount):
        self._seen_so_far += bytes_amount
        percentage = (float(self._seen_so_far) / float(self._size)) * 100
        TERM_UP_ONE_LINE = '\033[A'
        TERM_CLEAR_LINE = '\033[2K'
        sys.stdout.write('\r' + TERM_UP_ONE_LINE + TERM_CLEAR_LINE)
        sys.stdout.write('{} {}/{} ({}%)\n'.format(self._key_name, str(self._seen_so_far), str(self._size), str(percentage)))
        sys.stdout.flush()

Then called it like that

Note the capital C on Callback (that differs from online doc)

progress = ProgressPercentage(o_s3bucket, key_name)
o_s3bucket.download_file(key_name, full_local_path, Callback=progress)

where o_s3bucket is :

bucket_name = 'my_bucket_name'
aws_profile = 'default' # this is used to catch creds from .aws/credentials ini file
boto_session = boto3.session.Session(profile_name=aws_profile)
o_s3bucket = boto_session.resource('s3').Bucket(bucket_name)

hth

How might `Callback` be piped to `logging` to provide a % sent to AWS? Something like `INFO: 10% of xyz.file uploaded`, `INFO: 20% of xyz.file uploaded`, etc. up to and including `INFO: xyz.file successfully uploaded`. — SeaDude, Sep 19 '20 at 02:44
The callback gets called every times a new packets comes (or go for upload) I guess. But definitely a *bunch of time*. I wouldn't advise logging a progress because there's no added value and ti takes space for nothing. But you can do it of course: here in my `__call__` method you can call logging: it will log every steps. Hope I cover your questions — Boop, Sep 19 '20 at 13:23
Thanks @Boop. The added value (at least for me) comes in when the Python code is hosted as an Azure Function and data transfers are large. I'm looking to only log every 10% or the like. I'll chip away and see what I can come up with. — SeaDude, Sep 19 '20 at 15:36

score 1 · Answer 9 · answered Jan 28 '21 at 09:43

Here is an option I've found useful for with the use of click (just run pip install click before applying code below) library:

import click
import boto3
import os


file_path = os.path.join('tmp', 'file_path')
s3_client = boto3.resource('s3')
with click.progressbar(length=os.path.getsize(file_path)) as progress_bar:
    with open(file_path, mode='rb') as upload_file:
        s3_client.upload_fileobj(
            upload_file,
            'bucket_name',
            'foo_bar',
            Callback=progress_bar.update
)

score 0 · Answer 10 · answered Feb 02 '22 at 00:38

Here is code

try:
    import logging
    import boto3
    from botocore.exceptions import ClientError
    import os
    import sys
    import threading
    import math
    import re
    from boto3.s3.transfer import TransferConfig
except Exception as e:
    pass

ACCESS_KEY = 'XXXXXXXXXXXXXXXXX'
SECRET_KEY = 'XXXXXXXXXXXXXXXX'
REGION_NAME= 'us-east-1'
BucketName = "XXXXXXXXXXXXXXXX"
KEY = "XXXXXXXXXXXXXXXX"


class Size:
    @staticmethod
    def convert_size(size_bytes):

        if size_bytes == 0:
            return "0B"
        size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
        i = int(math.floor(math.log(size_bytes, 1024)))
        p = math.pow(1024, i)
        s = round(size_bytes / p, 2)
        return "%s %s" % (s, size_name[i])

class ProgressPercentage(object):
    def __init__(self, filename, filesize):
        self._filename = filename
        self._size = filesize
        self._seen_so_far = 0
        self._lock = threading.Lock()

    def __call__(self, bytes_amount):
        def convertSize(size):
            if (size == 0):
                return '0B'
            size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
            i = int(math.floor(math.log(size,1024)))
            p = math.pow(1024,i)
            s = round(size/p,2)
            return '%.2f %s' % (s,size_name[i])

        # To simplify, assume this is hooked up to a single filename
        with self._lock:
            self._seen_so_far += bytes_amount
            percentage = (self._seen_so_far / self._size) * 100
            sys.stdout.write(
                "\r%s  %s / %s  (%.2f%%)        " % (
                    self._filename, convertSize(self._seen_so_far), convertSize(self._size),
                    percentage))
            sys.stdout.flush()

class AWSS3(object):

    """Helper class to which add functionality on top of boto3 """

    def __init__(self, bucket, aws_access_key_id, aws_secret_access_key, region_name):

        self.BucketName = bucket
        self.client = boto3.client(
            "s3",
            aws_access_key_id=aws_access_key_id,
            aws_secret_access_key=aws_secret_access_key,
            region_name=region_name,
        )

    def get_size_of_files(self, Key):
        response = self.client.head_object(Bucket=self.BucketName, Key=Key)
        size = response["ContentLength"]
        return {"bytes": size, "size": Size.convert_size(size)}

    def put_files(self, Response=None, Key=None):
        """
        Put the File on S3
        :return: Bool
        """
        try:

            response = self.client.put_object(
                ACL="private", Body=Response, Bucket=self.BucketName, Key=Key
            )
            return "ok"
        except Exception as e:
            print("Error : {} ".format(e))
            return "error"

    def item_exists(self, Key):
        """Given key check if the items exists on AWS S3 """
        try:
            response_new = self.client.get_object(Bucket=self.BucketName, Key=str(Key))
            return True
        except Exception as e:
            return False

    def get_item(self, Key):

        """Gets the Bytes Data from AWS S3 """

        try:
            response_new = self.client.get_object(Bucket=self.BucketName, Key=str(Key))
            return response_new["Body"].read()

        except Exception as e:
            print("Error :{}".format(e))
            return False

    def find_one_update(self, data=None, key=None):

        """
        This checks if Key is on S3 if it is return the data from s3
        else store on s3 and return it
        """

        flag = self.item_exists(Key=key)

        if flag:
            data = self.get_item(Key=key)
            return data

        else:
            self.put_files(Key=key, Response=data)
            return data

    def delete_object(self, Key):

        response = self.client.delete_object(Bucket=self.BucketName, Key=Key,)
        return response

    def get_all_keys(self, Prefix=""):

        """
        :param Prefix: Prefix string
        :return: Keys List
        """
        try:
            paginator = self.client.get_paginator("list_objects_v2")
            pages = paginator.paginate(Bucket=self.BucketName, Prefix=Prefix)

            tmp = []

            for page in pages:
                for obj in page["Contents"]:
                    tmp.append(obj["Key"])

            return tmp
        except Exception as e:
            return []

    def print_tree(self):
        keys = self.get_all_keys()
        for key in keys:
            print(key)
        return None

    def find_one_similar_key(self, searchTerm=""):
        keys = self.get_all_keys()
        return [key for key in keys if re.search(searchTerm, key)]

    def __repr__(self):
        return "AWS S3 Helper class "

    def download_file(self,file_name, object_name):

        try:
            response = self.client.download_file(
                Bucket=self.BucketName,
                Key=object_name,
                Filename=file_name,
                Config=TransferConfig(
                    max_concurrency=10,
                    use_threads=True
                ),
                Callback=ProgressPercentage(file_name,
                                            (self.client.head_object(Bucket=self.BucketName,
                                                                     Key=object_name))["ContentLength"])
            )
        except ClientError as e:
            logging.error(e)
            return False
        return True



helper = AWSS3(aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, bucket=BucketName, region_name='us-east-1')
helper.download_file(file_name='test.zip', object_name=KEY)

Track download progress of S3 file using boto3 and callbacks

10 Answers10

Info

code

Then called it like that

Linked