I wrote a Python script that seeks to determine the total size of a all available AWS S3 buckets by making use of the AWS Boto 3 list_objects() method.
The logic is simple:
- Get the initial list of objects from each S3 bucket (automatically truncated after 1,000 objects)
- Iterate through each object in the list of objects, adding the size of that object to a total_size variable
- While the bucket still has additional objects, retrieve them and repeat step 2
Here's the relevant code snippet:
import boto3
s3_client = boto3.client('s3')
# Get all S3 buckets owned by the authenticated sender of the request
buckets = s3_client.list_buckets()
# For each bucket...
for bucket in buckets['Buckets']:
# Get up to first 1,000 objects in bucket
bucket_objects = s3_client.list_objects(Bucket=bucket['Name'])
# Initialize total_size
total_size = 0
# Add size of each individual item in bucket to total size
for obj in bucket_objects['Contents']:
total_size += obj['Size']
# Get additional objects from bucket, if more
while bucket_objects['IsTruncated']:
# Get next 1,000 objects, starting after final object of current list
bucket_objects = s3_client.list_objects(
Bucket=bucket['Name'],
Marker=bucket_objects['Contents'][-1]['Key'])
for obj in bucket_objects['Contents']:
total_size += obj['Size']
size_in_MB = total_size/1000000.0
print('Total size of objects in bucket %s: %.2f MB'
% (bucket['Name'], size_in_MB))
This code runs relatively quickly on buckets that have less than 5 MB or so of data in them, however when I hit a bucket that has 90+ MB of data in it, execution jumps up from milliseconds to 20-30+ seconds.
My hope was to use the threading module to parallelize the I/O portion of the code (getting the list of objects from S3) so that the total size of all objects in the bucket could be added as soon as the thread retrieving them completed rather than having to do that retrieval and addition sequentially.
I understand that Python doesn't support true multithreading because of the GIL, just to avoid getting responses to that effect, but my understanding is that since this is an I/O operation as opposed to a CPU-intensive operation, the threading module should be able to improve the run time.
The main difference between my problem and the several examples I've seen on here of threading implementations is that I'm not iterating over a known list or set. Here I must first retrieve a list of objects, see if the list is truncated, and then retrieve the next list of objects based off of the final object's key in the current list.
Can anyone explain a way to improve the run time of this code, or is it not possible in this situation?