Read large csv file from S3 into R

Question

I need to load a 3 GB csv file with about 18 million rows and 7 columns from S3 into R or RStudio respectively. My code for reading data from S3 usually works like this:

library("aws.s3")
obj <-get_object("s3://myBucketName/aFolder/fileName.csv")  
csvcharobj <- rawToChar(obj)  
con <- textConnection(csvcharobj)  
data <- read.csv(file = con)

Now, with the file being much bigger than usual, I receive an error

> csvcharobj <- rawToChar(obj)  
Error in rawToChar(obj) : long vectors not supported yet: raw.c:68

Reading this post, I understand that the vector is too long but how would I subset the data in this case? Any other suggestion how to deal with larger files to read from S3?

read the file in chunks using the `skip` and `nrows` arguments? using `data.table::fread` [with the analogous arguments] will probably make a **huge** difference in speed (or `readr::read_csv`, but my guess is that `fread` is faster) — Ben Bolker, Feb 12 '18 at 01:35
Does `save_object("s3://myBucketName/aFolder/fileName.csv", file = "myfile.csv"); data.table::fread("myfile.csv")` work? — Hugh, Feb 12 '18 at 04:15
@Hugh `save_object(...` results in a different issue: ```Error in writeBin(httr::content(r, as = "raw"), con = file) : long vectors not supported yet: ../../../../R-3.4.3/src/main/connections.c:4147``` — mmell, May 07 '18 at 19:08

leerssej · Answer 1 · 2019-11-22T00:27:59.303

Originally Building on Hugh's comment in the OP and adding an answer for those wishing to load regular size csv's from s3.

At least as of May 1, 2019, there is an s3read_using() function that allows you to read the object directly out of your bucket.

Thus

data <- 
    aws.s3::s3read_using(read.csv, object = "s3://your_bucketname/your_object_name.csv.gz")

Will do the trick. However, if you want to make your work run faster and cleaner, I prefer this:

data <- 
    aws.s3::s3read_using(fread, object = "s3://your_bucketname/your_object_name.csv.gz") %>%
    janitor::clean_names()

Previously the more verbose method below was required:

library(aws.s3)

data <- 
  save_object("s3://myBucketName/directoryName/fileName.csv") %>%
  data.table::fread()

It works for files up to at least 305 MB.

A better alternative to filling up your working directory with a copy of every csv you load:

data <- 
  save_object("s3://myBucketName/directoryName/fileName.csv",
              file = tempfile(fileext = ".csv")
             ) %>%
  fread()

If you are curious about where the tempfile is positioned, then Sys.getenv() can give some insight - see TMPDIR TEMP or TMP. More information can be found in the Base R tempfile docs..

score 2 · Answer 2 · answered Mar 20 '18 at 00:51

2

You can use AWS Athena and mount your S3 files to athena and query only selective records to R. How to run r with athena is explained in detail below.

https://aws.amazon.com/blogs/big-data/running-r-on-amazon-athena/

Hope it helps.

answered Mar 20 '18 at 00:51

Kannaiyan

12,554
3
44
83

score 1 · Answer 3 · answered Sep 30 '18 at 07:00

1

If you are on Spark or similar, a other workaround would be to - read/load the csv to DataTable and - continue processing it with R Server / sparklyr

answered Sep 30 '18 at 07:00

Ulrich Beck

76
5

Read large csv file from S3 into R

3 Answers3