1

I've got a S3 bucket being updated in realtime with API data. The files are saved with a .XXX format, where xxx is 1...n.

My R script needs to be able to grab the latest files and add them to the analysis dataframe. I've been using the aws.s3 package so far. After setting secret/access keys to environment:

mybucket <- get_bucket("mybucket1")

Returns an s3 object of 1000 elements (presumably more), and it looks like each object has Contents:list if 7, one of which is $LastModified. How do I get the name of the last modified file?

Mybucket     Large s3_bucket (1000 elements, 2.1Mb)
contents:List of 7
..$ Key : chr "folder1"
..$ LastModified: chr "2018-01-16T09:58:47.000Z"
..$ ETag : chr "\" nnnnnnnnnnn\""
etc (.. $Owner, $Storage class, $bucket, $-attr)
contents: List of 7
..$ Key : chr "folder1/file.1
..$ LastModified: chr "2018....etc"
..$ ETag : chr "...etc..."
etc....
contents: List of 7
etc.....

It's really the number after 'file.' that I need (in this case it would be 1).

After experimentation, I think and CLI command through RCurl would be a better option.

aws s3 ls s3://mybucket --recursive | grep APIdata@symbol=XXX&interval=5.1*

This gets me really close, but the command is leaving out the '&interval=5.1*' so it's returning ALL objects with 'APIdata@symbol=XXX*'

Garglesoap
  • 565
  • 6
  • 18
  • Please add some example data to your question (e. g. the first n elements) to make it easier for us to answer. Thx :-) – R Yoda Jan 17 '18 at 06:36
  • 1
    I updated with the description from the environment window. Thanks! – Garglesoap Jan 17 '18 at 06:49
  • The aws.s3 pkg doc is not very clear about the data types to answer your question. Can you please add the output of `dput(Mybucket[1:3])` to the question (but please anonymize the contents first!) since I need to know the exact data types and attributes to answer your question. But: Basically it looks like converting everything into a data.frame, converting the LastModifiedDate, sort it and take the last entry... – R Yoda Jan 17 '18 at 11:44
  • They're all JSON files with extensions ranging from .1,.2,.3.....x. I'm specifically trying to avoid calling entire file list of 30,000 files into R and trying to limit it to say .1* as this would give me .1, .10-19, .100-199, and all the thousands range. – Garglesoap Jan 18 '18 at 01:15
  • Minimizing the network traffic is a good idea, please always mention such non-functional requirements in your question to get precise answers. I think you need a shell script at the server side the filters the most recent file which is no R question but this does give you all the other information that `get_bucket` provides – R Yoda Jan 18 '18 at 06:45

2 Answers2

2

Easiest way ended up being with a system command:

currentfile <- system("aws s3 ls s3://bucket/folder --recursive | grep 'file.16' | sort | tail -n 1 | awk '{print $4}'", intern=TRUE)

grep grabs files with 'file.16' present, which significantly narrows the search as current file listings are in the 1600's. Intern=TRUE saves the response, in this case saves it in 'currentfile' as a character string. The sort, tail and print $4 orders files by modified date, takes last modified 4th column (name).

for reference: Downloading the latest file in an S3 bucket using AWS CLI?

Garglesoap
  • 565
  • 6
  • 18
0

I think your question is independent of AWS S3 and I would classify it as how can I create a data.frame from a list of lists and there are existing answers for that, e. g.:

R list of lists to data.frame

My solutions uses the handy rbindlist from the package data.table.

I had to guess about the data types of Mybucket but a solution could look like this:

# https://cran.r-project.org/web/packages/aws.s3/aws.s3.pdf
# get_bucket: returns a list of objects in the bucket (with class “s3_bucket”)
library(data.table)
library(lubridate)

# my personal assumption of the output of "get_bucket" is list of list (I have no S3 at hand to verify this)
Mybucket <- list(  list(Key = "folder1/file.1", LastModified = "2018-01-16T09:58:47.000Z", ETag = "\" nnnnnnnnnnn\"")
                 , list(Key = "folder2/file.2", LastModified = "2018-01-16T08:58:47.000Z", ETag = "xyz"))

dt <- rbindlist(Mybucket)  # convert into a data.table (enhanced data.frame)

dt[, LastModAsDate := ymd_hms(LastModified)]  # add a data column

dt.most.recent <- dt[order(-dt$LastModAsDate),][1]  # order by date descending, then pick the top-most row

which results in

> dt.most.recent
              Key             LastModified           ETag       LastModAsDate
1: folder1/file.1 2018-01-16T09:58:47.000Z " nnnnnnnnnnn" 2018-01-16 09:58:47

Please note that the date conversion may loose precision (milliseconds) but the overall solution is sketched anyhow...

To extract the number contained in the file extension use:

tools::file_ext(dt.most.recent$Key)
# [1] "1"
R Yoda
  • 8,358
  • 2
  • 50
  • 87
  • Thanks for the suggestion, but to clarify I am specfically trying to avoid bringing the entire list of 30k files into R. I'm thinking CLI call through RCurl would be a better option, but there's no --include --exclude options on the: aws s3 ls s3://my-bucket/ command. – Garglesoap Jan 18 '18 at 01:19