33

I need to fetch a list of items from S3 using Boto3, but instead of returning default sort order (descending) I want it to return it via reverse order.

I know you can do it via awscli:

aws s3api list-objects --bucket mybucketfoo --query "reverse(sort_by(Contents,&LastModified))"

and its doable via the UI console (not sure if this is done client side or server side)

I cant seem to see how to do this in Boto3.

I am currently fetching all the files, and then sorting...but that seems overkill, especially if I only care about the 10 or so most recent files.

The filter system seems to only accept the Prefix for s3, nothing else.

Nate
  • 1,630
  • 2
  • 24
  • 41
  • You can get all objects, get their last modified date and sort them based on the date. Check out this [question](https://stackoverflow.com/questions/9679344/how-can-i-get-last-modified-datetime-of-s3-objects-with-boto) – cookiedough Jun 15 '17 at 18:36
  • The S3 api does not support listing in this way. The CLI (and probably the console) will fetch everything and then perform the sort. – Jordon Phillips Jun 15 '17 at 19:37
  • You're getting the data back into Python, so simply sort the returned data. There's no need to ask boto3 to do it for you -- it's just one extra line of Python. – John Rotenstein Jun 15 '17 at 22:25
  • 7
    @JohnRotenstein the issue is complexity. why get N records, and then sort N records to get the set Z that you want, when you can ask AWS to only return Z set initially? same reason i wouldn't want to do `select * from table` . and then loop through and find "where X = 1". – Nate Jun 22 '17 at 14:51
  • You can use `subprocess` module to run the aws cli api that supports sort by date. – Vaulstein Feb 08 '18 at 07:07
  • I feel like none of the answers given here address the OP's question: is there any way to sort (or filter) by last modified date _at S3 side_? I neither want to wait the time nor pay the cost for millions of irrelevant files that are too old, just to find the recent ones. I assume this is not possible. Is it? – Mike Williamson Nov 22 '22 at 21:59

10 Answers10

32

If there are not many objects in the bucket, you can use Python to sort it to your needs.

Define a lambda to get the last modified time:

get_last_modified = lambda obj: int(obj['LastModified'].strftime('%s'))

Get all objects and sort them by last modified time.

s3 = boto3.client('s3')
objs = s3.list_objects_v2(Bucket='my_bucket')['Contents']
[obj['Key'] for obj in sorted(objs, key=get_last_modified)]

If you want to reverse the sort:

[obj['Key'] for obj in sorted(objs, key=get_last_modified, reverse=True)]
helloV
  • 50,176
  • 7
  • 137
  • 145
  • I did a variation of this...though not what i think is optimal: `get_last_modified = lambda obj: int(obj.last_modified.strftime('%s'))` `files = [obj.key for obj in sorted(unsorted, key=get_last_modified, reverse=True)][0:9]` – Nate Jun 23 '17 at 13:25
  • 18
    list_objects_v2 returns 1000 objects max, if your bucket contains more than 1000 the above won't work – Tomer Mar 26 '18 at 20:41
  • 7
    @Tomer thats why I put the disclaimer `If there are not many objects in the bucket` – helloV Mar 26 '18 at 21:25
  • 1
    Is it needed to cast the 'LastModified' to string and then to in? This seems to work as well: `get_last_modified = lambda obj: obj['LastModified']` – Popieluch Apr 19 '18 at 12:45
  • @Popieluch if you don't cast it to int, then sort will be a string sort instead of int sort. If you do not plan to sort, then cast is not needed. – helloV Apr 19 '18 at 14:00
  • 1
    @helloV but is there a reason to format the date as string in the first place? Comparing datetime objects directly seems to work. – Popieluch Apr 19 '18 at 14:56
  • One question here. I am not clear about the overall steps to run this piece of code. Do you put this lambda and boto3 function in AWS lambda service's function code area or put it in a Python script that run in EC2? If you put it in AWS lambda, do you need to assign a specific AWS role so it can access S3 from Lambda? Thanks. – user1457659 Oct 23 '18 at 21:21
  • @user1457659 this has nothing to do with AWS Lambda service. I am using Python lambda function. Should work on any machine that has Python installed and AWS credentials set correctly. – helloV Oct 23 '18 at 21:26
  • 4
    Apparently using `%s` is frowned upon. You can use `.timestamp()` instead: https://stackoverflow.com/questions/11743019/convert-python-datetime-to-epoch-with-strftime – compguy24 Aug 01 '19 at 14:23
  • This worked for me. Thank you – arn-arn Jan 21 '22 at 21:34
11

Slight improvement of above:

import boto3

s3 = boto3.resource('s3')
my_bucket = s3.Bucket('myBucket')
files = my_bucket.objects.filter()
files = [obj.key for obj in sorted(files, key=lambda x: x.last_modified, 
    reverse=True)]
mellifluous
  • 2,345
  • 2
  • 29
  • 45
zalmane
  • 336
  • 3
  • 5
7

I did a small variation of what @helloV posted below. its not 100% optimum, but it gets the job done with the limitations boto3 has as of this time.

s3 = boto3.resource('s3')
my_bucket = s3.Bucket('myBucket')
unsorted = []
for file in my_bucket.objects.filter():
   unsorted.append(file)

files = [obj.key for obj in sorted(unsorted, key=get_last_modified, 
    reverse=True)][0:9]
Nate
  • 1,630
  • 2
  • 24
  • 41
  • what does [0:9] do? – Vikrant Goel Apr 13 '20 at 05:36
  • @VikrantGoel filters it from 0 to 9, so gets a subset of the array – Nate Apr 17 '20 at 12:35
  • 3
    WARNING: Although you want to get the last X objects, in this solution you will still do GET on ALL the objects in the bucket and it may result in SIGNIFICANT cost (especially if you run this every time). – MikeL Nov 25 '20 at 07:24
6

it seems that is no way to do the sort by using boto3. According to the documentation, boto3 only supports these methods for Collections:

all(), filter(**kwargs), page_size(**kwargs), limit(**kwargs)

Hope this help in some way. https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.ServiceResource.buckets

4

A simpler approach, using the python3 sorted() function:

import boto3
s3 = boto3.resource('s3')

myBucket = s3.Bucket('name')

def obj_last_modified(myobj):
    return myobj.last_modified

sortedObjects = sorted(myBucket.objects.all(), key=obj_last_modified, reverse=True)

you now have a reverse sorted list, sorted by the 'last_modified' attribute of each Object.

weegolo
  • 354
  • 2
  • 14
3

To get the last modified files in a folder in S3:

import boto3

s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucket_name')
files = my_bucket.objects.filter(Prefix='folder_name/subfolder_name/')
files = [obj.key for obj in sorted(files, key=lambda x: x.last_modified,
    reverse=True)][0:2]

print(files)

To get the two files which are last modified:

files = [obj.key for obj in sorted(files, key=lambda x: x.last_modified,
    reverse=True)][0:2]
mellifluous
  • 2,345
  • 2
  • 29
  • 45
3

Today is possible to search the bucket using JMESPath, the same way we can do in AWS CLI (example).

import boto3
s3 = boto3.client("s3")

s3_paginator = s3.get_paginator('list_objects_v2')
s3_iterator = s3_paginator.paginate(Bucket='your-bucket-name')

filtered_iterator = s3_iterator.search(
    "Contents[?starts_with(Key, 'folder6/')]"
    " | reverse(sort_by(@, &to_string(LastModified)))"
    " | @[].Key"
    " | [:2]"
)

for key_data in filtered_iterator:
    print(key_data)

JMESPath explanation

  1. Contents[?starts_with(Key, 'folder6/')]: optional, selects objects inside a particular folder.
  2. reverse(sort_by(@, &to_string(LastModified))): sorts the objects by the "LastModified" date value, in a decreasing order.
  3. @[].Key: gets the objects names.
  4. [:2]: gets the first 2.

For example, if the bucket data looks like this:

{
  "Contents": [
    {"Key": "folder6/file-64.pdf", "LastModified": "2014-11-21T19:04:05.000Z", "ETag": "\"70ee1738b6b21e2c8a43f3a5ab0eee64\"", "Size": 187932, "StorageClass": "STANDARD"},
    {"Key": "folder5/file-63.pdf", "LastModified": "2014-11-21T19:03:05.000Z", "ETag": "\"70ee1738b6b21e2c8a43f3a5ab0eee63\"", "Size": 227543, "StorageClass": "STANDARD"},
    {"Key": "folder6/file-62.pdf", "LastModified": "2014-11-21T19:02:05.000Z", "ETag": "\"70ee1738b6b21e2c8a43f3a5ab0eee62\"", "Size": 173484, "StorageClass": "STANDARD"},
    {"Key": "folder6/file-61.pdf", "LastModified": "2014-11-21T19:01:05.000Z", "ETag": "\"70ee1738b6b21e2c8a43f3a5ab0eee61\"", "Size": 192940, "StorageClass": "STANDARD"}
  ]
}

It will yield this result::

[
  "folder6/file-64.pdf",
  "folder6/file-62.pdf"
]
ArKan
  • 171
  • 2
  • 11
2

s3 = boto3.client('s3')

get_last_modified = lambda obj: int(obj['LastModified'].strftime('%Y%m%d%H%M%S'))

def sortFindLatest(bucket_name):
    resp = s3.list_objects(Bucket=bucket_name)
    if 'Contents' in resp:
        objs = resp['Contents']
        files = sorted(objs, key=get_last_modified)
        for key in files:
            file = key['Key']
            cx = s3.get_object(Bucket=bucket_name, Key=file)

This works for me to sort by date and time. I am using Python3 AWS lambda. Your mileage may vary. It can be optimized, I purposely made it discrete. As mentioned in an earlier post, 'reverse=True' can be added to change the sort order.

Nelson
  • 31
  • 2
0
keys = []

kwargs = {'Bucket': 'my_bucket'}
while True:
    resp = s3.list_objects_v2(**kwargs)
    for obj in resp['Contents']:
        keys.append(obj['Key'])

    try:
        kwargs['ContinuationToken'] = resp['NextContinuationToken']
    except KeyError:
        break

this will get you all the keys in a sorted order

0

So my answer can be used for last modified, but I thought that if you've come to this page, there is a chance that'd you like to be able to sort your files in some other manner. So to kill 2 birds with one stone:

In this thread you can find the built-in method sorted. If you read the docs or this article, you will see that you can create your own function to give priority to how objects should be sorted. So for example in my case. I had a bunch of files that had some number in front of them and potentially a letter. It looked like this:

1.svg
10.svg
100a.svg
11.svg
110.svg
...
2.svg
20b.svg
200.svg
...
10011b.svg
...
etc

I wanted it to be sorted by the number up front - I didn't care about the letter behind the number, so I wrote this function:

def my_sort(x):
    try:
        # this will take the file name, split over the file type and take just the name, cast it to an int, and return it
        return int(x.split(".")[0])
    # if it couldn't do that
    except ValueError:
        # it will take the file name, split it over the extension, and take the name
        n = x.split(".")[0]
        s = ""
        # then for each character
        for e in n:
            # check to see if it is a digit and append it to a string if it is
            if e.isdigit():
                s += e
            # if its not a digit, it hit the character at the end of the name, so return it
            else:
                return int(s)

Which means now I can do this:

import boto3
s3r = boto3.resource('s3')
bucket = s3r.Bucket('my_bucket')
os = bucket.objects.filter(Prefix="my_prefix/")
os = [o.key.split("/")[-1] for o in os]
os = sorted(os, key=my_sort)

# do whatever with the sorted data

which will sort my files by the numerical suffix in their name.

Shmack
  • 1,933
  • 2
  • 18
  • 23