9

I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.

I read this, which helped me to some extent.

Is there a way to find out which objects are being accessed and which are not?

Matt D
  • 3,289
  • 1
  • 15
  • 29
Subbu
  • 663
  • 1
  • 9
  • 20
  • The question doesn't have enough information to help answer it as it stands. What exactly do you mean by the objects which are not in use? – mohit Oct 24 '18 at 08:22
  • Oh sorry. I meant that the objects which are not used by any other AWS services – Subbu Oct 24 '18 at 09:08
  • You should know about using your objects. However, you can enable S3 log and check it, which one of your objects use. – Reza Mousavi Oct 24 '18 at 09:32

3 Answers3

5

There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.

You have a few options:

  1. Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
  2. Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
  3. Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
  4. Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.
Matt D
  • 3,289
  • 1
  • 15
  • 29
2

No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.

For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.

If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • Yeah, I'm happy for your response. Here, in my new company, I'm in a situation like I don't know which S3 buckets are used and why they created. Now I'm figuring out all the projects which are already developed. So that's why I wanted to know which S3 objects are currently in use. Am I thinking in the right direction?. Is there any way to find out via scripting? I found that Amazon Athena will help me to solve this. But I've no idea about that. – Subbu Oct 24 '18 at 09:51
  • 2
    The cost of storage is often cheaper than the time you would spend figuring out whether data is important or being used. It's just like an online photo collection — it's cheaper to store everything than spend the time to sort out unwanted photos. So, you can certainly do a better job going forward, but it might not be worth worrying about the old stuff. Just make sure it isn't publicly accessible. – John Rotenstein Oct 24 '18 at 10:51
  • :) Yeah, even I'm also thinking like that. But it's just my curiosity to know that there is any solution for that. Thank you. – Subbu Oct 24 '18 at 10:55
  • @Subbu there is a problem with the fact that you are referring to objects in S3 being "in use." That is not really the right question to ask, because it doesn't have a meaningful answer. The videos and photos of my children when they were younger are not "in use" but I absolutely do not want them deleted... and there is no technical means by which anyone could learn that, other than by asking me. You need to understand why the objects were created, by whom, and for what purpose. Left alone, S3 bills *only* for actual storage space consumed by objects -- there is nothing "unused." – Michael - sqlbot Oct 24 '18 at 12:23
  • Thank you for all your responses. I'm glad that I had very good discussion with you. – Subbu Oct 24 '18 at 12:30
1

There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.

Here is the description from AWS blog:

  1. The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.

  2. An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.

  3. An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.

  4. The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:

  • Capture the number of days (x) configuration from the S3 Lifecycle configuration.
  • Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
  • Write a manifest file with the list of these objects to an S3 bucket.
  • Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
  1. The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Aniket Kulkarni
  • 12,825
  • 9
  • 67
  • 90