Get last modfied file from AWS S3 bucket in R

Question

I've got a S3 bucket being updated in realtime with API data. The files are saved with a .XXX format, where xxx is 1...n.

My R script needs to be able to grab the latest files and add them to the analysis dataframe. I've been using the aws.s3 package so far. After setting secret/access keys to environment:

mybucket <- get_bucket("mybucket1")

Returns an s3 object of 1000 elements (presumably more), and it looks like each object has Contents:list if 7, one of which is $LastModified. How do I get the name of the last modified file?

Mybucket     Large s3_bucket (1000 elements, 2.1Mb)
contents:List of 7
..$ Key : chr "folder1"
..$ LastModified: chr "2018-01-16T09:58:47.000Z"
..$ ETag : chr "\" nnnnnnnnnnn\""
etc (.. $Owner, $Storage class, $bucket, $-attr)
contents: List of 7
..$ Key : chr "folder1/file.1
..$ LastModified: chr "2018....etc"
..$ ETag : chr "...etc..."
etc....
contents: List of 7
etc.....

It's really the number after 'file.' that I need (in this case it would be 1).

After experimentation, I think and CLI command through RCurl would be a better option.

aws s3 ls s3://mybucket --recursive | grep APIdata@symbol=XXX&interval=5.1*

This gets me really close, but the command is leaving out the '&interval=5.1*' so it's returning ALL objects with 'APIdata@symbol=XXX*'

Please add some example data to your question (e. g. the first n elements) to make it easier for us to answer. Thx :-) — R Yoda, Jan 17 '18 at 06:36
I updated with the description from the environment window. Thanks! — Garglesoap, Jan 17 '18 at 06:49
The aws.s3 pkg doc is not very clear about the data types to answer your question. Can you please add the output of `dput(Mybucket[1:3])` to the question (but please anonymize the contents first!) since I need to know the exact data types and attributes to answer your question. But: Basically it looks like converting everything into a data.frame, converting the LastModifiedDate, sort it and take the last entry... — R Yoda, Jan 17 '18 at 11:44
They're all JSON files with extensions ranging from .1,.2,.3.....x. I'm specifically trying to avoid calling entire file list of 30,000 files into R and trying to limit it to say .1* as this would give me .1, .10-19, .100-199, and all the thousands range. — Garglesoap, Jan 18 '18 at 01:15
Minimizing the network traffic is a good idea, please always mention such non-functional requirements in your question to get precise answers. I think you need a shell script at the server side the filters the most recent file which is no R question but this does give you all the other information that `get_bucket` provides — R Yoda, Jan 18 '18 at 06:45

score 2 · Accepted Answer · answered Jan 18 '18 at 02:46

Easiest way ended up being with a system command:

currentfile <- system("aws s3 ls s3://bucket/folder --recursive | grep 'file.16' | sort | tail -n 1 | awk '{print $4}'", intern=TRUE)

grep grabs files with 'file.16' present, which significantly narrows the search as current file listings are in the 1600's. Intern=TRUE saves the response, in this case saves it in 'currentfile' as a character string. The sort, tail and print $4 orders files by modified date, takes last modified 4th column (name).

for reference: Downloading the latest file in an S3 bucket using AWS CLI?

R Yoda · Answer 2 · 2018-01-17T17:03:54.993

I think your question is independent of AWS S3 and I would classify it as how can I create a data.frame from a list of lists and there are existing answers for that, e. g.:

R list of lists to data.frame

My solutions uses the handy rbindlist from the package data.table.

I had to guess about the data types of Mybucket but a solution could look like this:

# https://cran.r-project.org/web/packages/aws.s3/aws.s3.pdf
# get_bucket: returns a list of objects in the bucket (with class “s3_bucket”)
library(data.table)
library(lubridate)

# my personal assumption of the output of "get_bucket" is list of list (I have no S3 at hand to verify this)
Mybucket <- list(  list(Key = "folder1/file.1", LastModified = "2018-01-16T09:58:47.000Z", ETag = "\" nnnnnnnnnnn\"")
                 , list(Key = "folder2/file.2", LastModified = "2018-01-16T08:58:47.000Z", ETag = "xyz"))

dt <- rbindlist(Mybucket)  # convert into a data.table (enhanced data.frame)

dt[, LastModAsDate := ymd_hms(LastModified)]  # add a data column

dt.most.recent <- dt[order(-dt$LastModAsDate),][1]  # order by date descending, then pick the top-most row

which results in

> dt.most.recent
              Key             LastModified           ETag       LastModAsDate
1: folder1/file.1 2018-01-16T09:58:47.000Z " nnnnnnnnnnn" 2018-01-16 09:58:47

Please note that the date conversion may loose precision (milliseconds) but the overall solution is sketched anyhow...

To extract the number contained in the file extension use:

tools::file_ext(dt.most.recent$Key)
# [1] "1"

Thanks for the suggestion, but to clarify I am specfically trying to avoid bringing the entire list of 30k files into R. I'm thinking CLI call through RCurl would be a better option, but there's no --include --exclude options on the: aws s3 ls s3://my-bucket/ command. — Garglesoap, Jan 18 '18 at 01:19

Get last modfied file from AWS S3 bucket in R

2 Answers2