Read one by one file from s3 in r

Question

I want to read csv files in r that are given in s3 directory. Each file is more than 6GB in size, and every file is needed for further calculation in r. Imagine that I have 10 files in s3 folder, I need to read each of them separately before for loop. Firstly, I tried this and it works in a case when I know name of the csv file:

library(aws.s3)
Sys.setenv("AWS_ACCESS_KEY_ID" = "xyy",
           "AWS_SECRET_ACCESS_KEY" = "yyx")

data <- 
  s3read_using(FUN=read.csv, object="my_folder/file.csv",
               sep = ",",stringsAsFactors = F, header=T)

However, how can I access multiple files without explicitly given their names in s3read_using function. This is neccessary beacuse I use partition() in Spark which divides original dataset into subparts with some generic names (e.g. part1-0839709037fnfih.csv). If I can automatically list csv files from a s3 folder and used them before my calculation that would be great.

get_ls_files <- .... #gives me list of all csv files in S3 folder

for (i in 1:length(get_ls_files)){

    filename = get_ls_files[i]

    tmp = s3read_using(FUN=read.csv, object=paste("my_folder/",filename),
               sep = ",",stringsAsFactors = F, header=T)

    .....
}

This should be helpful: https://stackoverflow.com/questions/11433432/how-to-import-multiple-csv-files-at-once — kath, Sep 30 '19 at 11:27
@kath Thanks for the response! This works on local machine, but not in S3. I tried `list.files(path='s3://my_folder', pattern="*.csv")` and the return is `character(0)`. — Makaroni, Sep 30 '19 at 11:33
If you have 6gb files you might want to try out `vroom` (https://github.com/r-lib/vroom) instead of the base `read.csv` you may get considerable speed improvements. — Richard J. Acton, Sep 30 '19 at 16:52

score 5 · Accepted Answer · answered Sep 30 '19 at 11:51

5

I found an answer if anyone needs it, although the documentation is not good. To get a list of files in particular S3 folder you need to use get_bucket and define a prefix. After this, search the list for extension .csv and get list of all .csv files in particular S3 folder.

tmp = get_bucket(bucket = "my_bucket", prefix="folder/subfolder")
list_csv = data.frame(tmp)
csv_paths = list_csv$Key[grep(".csv", list_csv$Key)]

answered Sep 30 '19 at 11:51

Makaroni

880
3
15
34

Guess I'm just bad at for loops, but how do I then iterate over this list to read all the files in using `s3_read_using` or `get_object` – Ben G Jun 08 '21 at 18:34
@BenG This was long time ago, but I think the `for` loop is given in the question itself. The answer here is the way we can get list of all `.csv` files in `S3` folder, and this info you can use before the `for` loop given in the question above (you can use this info to replace `get_ls_files` in the question). – Makaroni Sep 13 '21 at 10:43

Read one by one file from s3 in r

1 Answers1