How can I use threading in Python to parallelize AWS S3 API calls?

Question

I wrote a Python script that seeks to determine the total size of a all available AWS S3 buckets by making use of the AWS Boto 3 list_objects() method.

The logic is simple:

Get the initial list of objects from each S3 bucket (automatically truncated after 1,000 objects)
Iterate through each object in the list of objects, adding the size of that object to a total_size variable
While the bucket still has additional objects, retrieve them and repeat step 2

Here's the relevant code snippet:

import boto3

s3_client = boto3.client('s3')

# Get all S3 buckets owned by the authenticated sender of the request
buckets = s3_client.list_buckets()

# For each bucket...
for bucket in buckets['Buckets']:
    # Get up to first 1,000 objects in bucket
    bucket_objects = s3_client.list_objects(Bucket=bucket['Name'])

    # Initialize total_size
    total_size = 0

    # Add size of each individual item in bucket to total size
    for obj in bucket_objects['Contents']:
        total_size += obj['Size']

    # Get additional objects from bucket, if more
    while bucket_objects['IsTruncated']:
        # Get next 1,000 objects, starting after final object of current list
        bucket_objects = s3_client.list_objects(
            Bucket=bucket['Name'],
            Marker=bucket_objects['Contents'][-1]['Key'])
        for obj in bucket_objects['Contents']:
            total_size += obj['Size']

    size_in_MB = total_size/1000000.0
    print('Total size of objects in bucket %s: %.2f MB'
        % (bucket['Name'], size_in_MB))

This code runs relatively quickly on buckets that have less than 5 MB or so of data in them, however when I hit a bucket that has 90+ MB of data in it, execution jumps up from milliseconds to 20-30+ seconds.

My hope was to use the threading module to parallelize the I/O portion of the code (getting the list of objects from S3) so that the total size of all objects in the bucket could be added as soon as the thread retrieving them completed rather than having to do that retrieval and addition sequentially.

I understand that Python doesn't support true multithreading because of the GIL, just to avoid getting responses to that effect, but my understanding is that since this is an I/O operation as opposed to a CPU-intensive operation, the threading module should be able to improve the run time.

The main difference between my problem and the several examples I've seen on here of threading implementations is that I'm not iterating over a known list or set. Here I must first retrieve a list of objects, see if the list is truncated, and then retrieve the next list of objects based off of the final object's key in the current list.

Can anyone explain a way to improve the run time of this code, or is it not possible in this situation?

with aws s3 command, it is faster, but not sure why, it's API call as well: `time aws s3api list-objects --bucket --output json --query "[sum(Contents[].Size), length(Contents[])]"`. Compare with `s3_client.list_objects`, it is two times faster. — BMW, May 10 '16 at 00:29
and take a look on this document: https://www.quora.com/What-is-the-fastest-way-to-measure-the-total-size-of-an-S3-bucket — BMW, May 10 '16 at 00:33
Thanks for the info, @BMW - Unfortunately for me the CLI call only seems to be maybe 20-25% faster. The bucket causing me grief has over 75,000 objects in it. Ideally I'd also be sticking to Python, though this is beneficial to know! — Mark, May 10 '16 at 00:55

score 15 · Answer 1 · answered Oct 20 '20 at 02:16

I ran into similar problems.

It seems to be important to create a separate session for each thread.

So instead of

s3_client = boto3.client('s3')

you need to write

s3_client = boto3.session.Session().client('s3')

otherwise threads interfere with each other, and random errors occur.

Beyond that the normal issues of multithreading apply.

My project is upload 135,000 files to an S3 bucket. So far I have found that I get the best performance with 8 threads. What would otherwise take 3.6 hours, takes 1.25 hours.

I don't know why this had zero points, it seems like the right answer. Helped me at least! — Karl Rosaen, Oct 26 '20 at 19:12

score 1 · Answer 2 · answered Dec 04 '19 at 17:20

I have a solution which may not work in all cases but can cover a good deal of scenarios. If you have objects organised hierarchically in subfolders, then first only list subfolders using mechanism described in this post

Then using these obtained set of prefixes submit them to a multiprocessing pool (or Thread Pool) where each worker will fetch all keys specific to one prefix and collect them in a shared container using multiprocessing Manager. In this way keys will be fetched in parallel.

Above solution will perform best if keys are distributed evenly and hierarchically and worst if data is organized flat.

How can I use threading in Python to parallelize AWS S3 API calls?

2 Answers2