I wrote a script to count the number of objects in s3 buckets and total size of each bucket. The code works when I run it against a few test buckets, but then times out when I include all production buckets. Thousands of objects.
import boto3
s3 = boto3.resource('s3')
bucket_list = []
bucket_size = {}
bucket_list = s3.buckets.all()
skip_list = ('some-test-bucket')
for bu in bucket_list:
if bu.name not in skip_list:
bucket_size[bu.name] = [0, 0]
print(bu.name)
for obj in bu.objects.all():
bucket_size[bu.name][0] += 1
bucket_size[bu.name][1] += obj.size
print("{0:30} {1:15} {2:10}".format("bucket", "count", "size"))
for i,j in bucket_size.items():
print("{0:30} {1:15} {2:10}".format(i, j[0], j[1]))
It starts to run, moves along and then gets hung on certain buckets like this:
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL:
There's no quick way to get metadata like this? This is doing the hard way in a sense - counting every object.
So, I'm asking if there's a better script, not why it times out. When I click through some of the timed out buckets, I noticed there are some .gz files in there. Don't know why it would matter.
Of course I looked at the documentation, but I find it hard to get meaningful actionable info.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html