4

I am writing a script in python where I need to get the latest modified file in a bucket (using a prefix), but as far as I have read, I cannot do that query directly from python (using boto3 at least), So I have to retrieve the information of every object in my bucket.

I would have to do some query of several thousands of files, and I do not want to get any surprise in my billing.

If I do a query where I retrieve the metadata of all the objects in my bucket to sort them later locally, will I be charged as a single request or will it count as a request per object?

Thank you all in advance

rdas
  • 20,604
  • 6
  • 33
  • 46
Alejo Dev
  • 2,290
  • 4
  • 29
  • 45

1 Answers1

10

Popular

A common method people use is via s3api to consolidate multiple calls into a single LIST request for every 1000 objects and then use --query to define your filtering operation, such as:

aws s3api list-objects-v2 --bucket your-bucket-name --query 'Contents[?contains(LastModified, `$DATE`)]'

Although please keep in mind that this isn't a good solution for two reasons:

  1. This does not scale really well especially with large buckets and it does not help much in minimizing the data outbound.
  2. It does not reduce the number of S3 API calls because the --query parameter isn't performed in the server-side. It just so happened to be a feature of this aws-cli command. To illustrate, this is how it would look like in boto3 and as you can see we'd still need to query it on client-side:
import boto3

client = boto3.client('s3',region_name='us-east-1')

response = client.list_objects_v2(Bucket='your-bucket-name')

results = sorted(response['Contents'], key=lambda item: item['LastModified'])[-1])

Probably

One thing you could *probably* do depending on your specific use case is to utilize S3 Event Notifications to automatically publish an event to SQS which gives you the opportunity to poll for all the S3 object events along with their metadata information which is more lightweight. This is still going to cost some money and it's not going to work if you already have an existing big bucket to begin with. Plus the fact that you'll have to actively poll for the messages since they won't persist too long.

Perfect (sorta)

This sounds to me like a good use case for S3 Inventory. It will deliver a daily file for you which is comprised of the list of objects and their metadata information based on your specifications. See https://docs.aws.amazon.com/AmazonS3/latest/user-guide/configure-inventory.html

maronavenue
  • 226
  • 1
  • 4
  • 1
    Thank you very much for taking the effort on writing this detailed anwser. I wil check out the S3 inventory you mention. Although I still haven't found the answer to my original question: If I query all the items to sort them locally will I be charged for every object retrieved? or will I be charged for a single request? – Alejo Dev Oct 23 '20 at 18:43
  • Happy to help. Yes, you're still going to be charged depending on the S3 API request type that you used to retrieve metadata for all objects, ListObjects (LIST) per se. – maronavenue Oct 23 '20 at 18:53
  • If I understand it correctly: I will be charged one request per 1000 objects listed, is that correct? – Alejo Dev Oct 23 '20 at 19:05
  • 2
    That's correct. You pay for that one request only. To give you an idea, retrieving 1 million objects using LIST would be 1000 x 1000 which would cost you about $5. You can try playing around in https://calculator.s3.amazonaws.com/ – maronavenue Oct 23 '20 at 19:12
  • 3
    After checking out my costs, it was 0.005 USD per 1000 requests, so in the example you put, for listing 1 million objects it would cost 0,005 (1000 LIST requests, each one 1000 objects) – Alejo Dev Oct 26 '20 at 17:09