4

In an S3 bucket, I have thousands and thousands of files stored with names having a structure that comes down to prefix and number:

A-0001
A-0002
A-0003
B-0001
B-0002
C-0001
C-0002
C-0003
C-0004
C-0005

New objects for a given prefix should come in with varying frequency, but might not. Older objects may disappear.

Is there a way to efficiently query S3 for the highest number of every prefix, i.e. without listing the entire bucket? The result I want is:

A-0003
B-0002
C-0005

The S3 API itself does not seem to offer anything usable for that. However, perhaps another service, like Athena, could do it? So far I have only found it capable of searching within objects, but all I care about are their key names. If it can report on the contents of objects in the bucket, can't it on the bucket itself?

I would be okay with the latest modification date per prefix, but I want to avoid having to switch to a versioned bucket with just the prefixes as names to achieve that.

Thijs van Dien
  • 6,516
  • 1
  • 29
  • 48
  • would something like this help, https://stackoverflow.com/questions/45429556/how-list-amazon-s3-bucket-contents-by-modified-date ? – jarnohenneman Dec 14 '17 at 02:31
  • @jamohenneman I'm afraid not. The latest modification date per prefix would be fine too, but I can't query for any specific date (range), because for each I want the highest number overall, and the rates of increment differ. – Thijs van Dien Dec 14 '17 at 02:36
  • 1
    There is no way to do this without listing the entire bucket. – helloV Dec 14 '17 at 02:53

3 Answers3

1

I think this is what you are looking for:

variable name is $path and you can regexp to get the pattern you are querying...

WHERE regexp_extract(sp."$path", '[^/]+$') like concat('%',cast(current_date - interval '1' day as varchar),'.csv')

0

The S3 API itself does not seem to offer anything usable for that. However, perhaps another service, like Athena, could do it?

Yes at the moment, there is not direct way of doing it only with AWS S3. Even with Athena, it will go through the files to query their content but it will be easier using standard SQL support with Athena and would be faster since the queries runs in parallel.

So far I have only found it capable of searching within objects, but all I care about are their key names.

Both Athena and S3 Select is to query by content not keys.

The best approach I can recommend is to use AWS DynamoDB to keep the metadata of the files, including file names for faster querying.

Ashan
  • 18,898
  • 4
  • 47
  • 67
  • What about a Lambda function that triggers on new file creation, and copies the file to another S3 bucket, when a new version gets updated you remove the old one and copy the new one ? – jarnohenneman Dec 14 '17 at 07:00
  • Didn't get your point. Can you explain it further on how moving file to another bucket relates with querying? – Ashan Dec 14 '17 at 07:09
  • If you have a separate bucket with the last 3 files of each prefix, you can just query that bucket and have an up to date list, and the latest version of the files to do other things with as a Bonus, rather then using a DynamoDB :) – jarnohenneman Dec 14 '17 at 07:24
0

Use S3 Inventory to create a list of objects and Athena to query it:

https://aws.amazon.com/blogs/storage/manage-and-analyze-your-data-at-scale-using-amazon-s3-inventory-and-amazon-athena/

Craig Younkins
  • 1,360
  • 2
  • 11
  • 18