0

I wrote a script to count the number of objects in s3 buckets and total size of each bucket. The code works when I run it against a few test buckets, but then times out when I include all production buckets. Thousands of objects.

import boto3
s3 = boto3.resource('s3')

bucket_list = []
bucket_size = {}

bucket_list = s3.buckets.all()

skip_list = ('some-test-bucket')

for bu in bucket_list:
    if bu.name not in skip_list:
        bucket_size[bu.name] = [0, 0]
        print(bu.name)
        for obj in bu.objects.all():
            bucket_size[bu.name][0] += 1
            bucket_size[bu.name][1] += obj.size

print("{0:30} {1:15} {2:10}".format("bucket", "count", "size"))

for i,j in bucket_size.items():
    print("{0:30} {1:15} {2:10}".format(i, j[0], j[1]))

It starts to run, moves along and then gets hung on certain buckets like this:

botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL:

There's no quick way to get metadata like this? This is doing the hard way in a sense - counting every object.

So, I'm asking if there's a better script, not why it times out. When I click through some of the timed out buckets, I noticed there are some .gz files in there. Don't know why it would matter.

Of course I looked at the documentation, but I find it hard to get meaningful actionable info.

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html

Chuck
  • 1,061
  • 1
  • 20
  • 45
  • 1
    Look at the 2nd answer: https://stackoverflow.com/questions/2862617/how-can-i-tell-how-many-objects-ive-stored-in-an-s3-bucket – ketcham Jul 24 '19 at 17:52
  • 1
    Instead of calling `bucket.objects.all()` I suggest using the Boto3 client interface along with a Boto3 paginator. Like the answer here: https://stackoverflow.com/questions/49482274/iterate-over-files-in-an-s3-bucket-with-folder-structure?noredirect=1 – Mark B Jul 24 '19 at 17:56

1 Answers1

1

If you just wish to know the number of objects in a bucket, you can use metrics from Amazon CloudWatch.

From Monitoring Metrics with Amazon CloudWatch - Amazon Simple Storage Service:

BucketSizeBytes

The amount of data in bytes stored in a bucket in the STANDARD storage class, INTELLIGENT_TIERING storage class, Standard - Infrequent Access (STANDARD_IA) storage class, OneZone - Infrequent Access (ONEZONE_IA), Reduced Redundancy Storage (RRS) class, Deep Archive Storage (DEEP_ARCHIVE) class or, Glacier (GLACIER) storage class. This value is calculated by summing the size of all objects in the bucket (both current and noncurrent objects), including the size of all parts for all incomplete multipart uploads to the bucket.

NumberOfObjects

The total number of objects stored in a bucket for all storage classes except for the GLACIER storage class. This value is calculated by counting all objects in the bucket (both current and noncurrent objects) and the total number of parts for all incomplete multipart uploads to the bucket.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470