1

I am trying to read headers from csv file which is stored on s3. I have tried several ways to do that. But, all my methods download csv from s3 locally in R and then read header. Its not efficient way to do.

My attempts:

dt <- aws.s3::s3read_using(FUN = fread(headers=T,nrows = 1),
                     bucket = "bucket_name/path/,
                     object = "abc.csv"))
cols <- colnames(dt)

Second Attempt:

  # Getting file locally and then reading headers.
  system(paste("s3cmd get --force -v ", s3Path, s3FileName, " ", s3FileName, sep = ""))
  df <- data.table::fread(s3FileName, ...)
  cols <- colnames(df)

I know there might be some efficient way to do it. Any suggestions would be really appreciated. I am specifically looking to do that in R.

Rushabh Patel
  • 2,672
  • 13
  • 34

1 Answers1

2

Short answer: S3 is a document store, not a file system. You cannot(*) do file system operations on remote S3 objects.

Longer, more correct answer: You do not have to download the entire file each time. You can use the s3 api to pull a section of the file as demonstrated.

Pull down the first n kb of each file, where n is sufficiently large to always get you the headers, then process those headers as normal.

mcfinnigan
  • 11,442
  • 35
  • 28