4
def get_latest_file_movement(**kwargs):
    get_last_modified = lambda obj: int(obj['LastModified'].strftime('%s'))
    s3 = boto3.client('s3')
    objs = s3.list_objects_v2(Bucket='my-bucket',Prefix='prefix')['Contents']
    last_added = [obj['Key'] for obj in sorted(objs, key=get_last_modified, reverse=True)][0]
    return last_added

Above code gets me the latest file however i only want the files ending with 'csv'

Elad Kalif
  • 14,110
  • 2
  • 17
  • 49
facepalmdev7
  • 63
  • 1
  • 4

2 Answers2

5

Filter by suffix

If the S3 object's key is a filename, the suffix for your objects is a filename-extension (like .csv).

So filter the objects by key ending with .csv.

Use filter(predicate, iterable) operation with predicate as lambda testing for str.endswith(suffix):

s3 = boto3.client('s3')
objs = s3.list_objects_v2(Bucket='my-bucket',Prefix='prefix')['Contents']

csvs = filter(lambda obj: obj['Key'].endswith('.csv'), objs)  # csv only 
csvs.sort(key=lambda obj: obj['LastModified'], reverse=True)  # last first, sort by modified-timestamp descending

return csvs[0]

Note: To get the last-modified only

This solution alternates the sort direction using reverse=True (descending) to pick the first which will be the last modified. You can also sort default (ascending) and pick the last with [-1] as answered by Kache in your preceding question.

Simplification

From the boto3 list_objects_v2 docs about the response structure:

Contents (list) ... LastModified (datetime) -- Creation date of the object.

Boto3 returns a datetime object for LastModified. See also Getting S3 objects' last modified datetimes with boto.

So why do we need additional steps to format it as string and then convert to int: int(obj['LastModified'].strftime('%s')) ?

Python can also sort the datetime directly.

Limitation warning

S3's API operation and its corresponding Boto3 method list_objects_v2 limit the result set to one thousand objects:

Returns some or all (up to 1,000) of the objects in a bucket with each request.

So, for buckets with many homonymous objects, even after applying the prefix-filter, your result can be implicitly truncated.

hc_dev
  • 8,389
  • 1
  • 26
  • 38
  • 1
    I like this answer, but you have obj and the lambda function swapped in the filter function. Filter function requires the first parameter to be the function that returns True/False and the second parameter to be the collection. – Danny Apr 29 '22 at 15:56
  • @Danny, thanks for spotting this. You always have to pay attention when using built-ins `filter` and `sorted` (the order of parameters is different). That's why I prefer [`list.sort()`](https://docs.python.org/3/howto/sorting.html) among others (modify in place, readability, etc.). – hc_dev May 02 '22 at 11:25
  • 1
    Does not scale past 1000 objects under `prefix`. – Illya Moskvin Jun 09 '23 at 18:46
  • @IllyaMoskvin, thanks for the scalability hint. I added this as "Limitation warning". – hc_dev Jun 19 '23 at 18:46
0

You can check if they end with .csv:

def get_latest_file_movement(**kwargs):
    get_last_modified = lambda obj: int(obj['LastModified'].strftime('%s'))
    s3 = boto3.client('s3')
    objs = s3.list_objects_v2(Bucket='my-bucket',Prefix='prefix')['Contents']

    last_added = [obj['Key'] for obj in sorted(objs, key=get_last_modified, reverse=True) if obj['Key'].endswith('.csv')][0]

    return last_added
Marcin
  • 215,873
  • 14
  • 235
  • 294