3

I want to read csv files in r that are given in s3 directory. Each file is more than 6GB in size, and every file is needed for further calculation in r. Imagine that I have 10 files in s3 folder, I need to read each of them separately before for loop. Firstly, I tried this and it works in a case when I know name of the csv file:

library(aws.s3)
Sys.setenv("AWS_ACCESS_KEY_ID" = "xyy",
           "AWS_SECRET_ACCESS_KEY" = "yyx")

data <- 
  s3read_using(FUN=read.csv, object="my_folder/file.csv",
               sep = ",",stringsAsFactors = F, header=T)

However, how can I access multiple files without explicitly given their names in s3read_using function. This is neccessary beacuse I use partition() in Spark which divides original dataset into subparts with some generic names (e.g. part1-0839709037fnfih.csv). If I can automatically list csv files from a s3 folder and used them before my calculation that would be great.

get_ls_files <- .... #gives me list of all csv files in S3 folder

for (i in 1:length(get_ls_files)){

    filename = get_ls_files[i]

    tmp = s3read_using(FUN=read.csv, object=paste("my_folder/",filename),
               sep = ",",stringsAsFactors = F, header=T)

    .....
}
Makaroni
  • 880
  • 3
  • 15
  • 34
  • This should be helpful: https://stackoverflow.com/questions/11433432/how-to-import-multiple-csv-files-at-once – kath Sep 30 '19 at 11:27
  • @kath Thanks for the response! This works on local machine, but not in S3. I tried `list.files(path='s3://my_folder', pattern="*.csv")` and the return is `character(0)`. – Makaroni Sep 30 '19 at 11:33
  • 1
    If you have 6gb files you might want to try out `vroom` (https://github.com/r-lib/vroom) instead of the base `read.csv` you may get considerable speed improvements. – Richard J. Acton Sep 30 '19 at 16:52

1 Answers1

5

I found an answer if anyone needs it, although the documentation is not good. To get a list of files in particular S3 folder you need to use get_bucket and define a prefix. After this, search the list for extension .csv and get list of all .csv files in particular S3 folder.

tmp = get_bucket(bucket = "my_bucket", prefix="folder/subfolder")
list_csv = data.frame(tmp)
csv_paths = list_csv$Key[grep(".csv", list_csv$Key)]
Makaroni
  • 880
  • 3
  • 15
  • 34
  • Guess I'm just bad at for loops, but how do I then iterate over this list to read all the files in using `s3_read_using` or `get_object` – Ben G Jun 08 '21 at 18:34
  • @BenG This was long time ago, but I think the `for` loop is given in the question itself. The answer here is the way we can get list of all `.csv` files in `S3` folder, and this info you can use before the `for` loop given in the question above (you can use this info to replace `get_ls_files` in the question). – Makaroni Sep 13 '21 at 10:43