Read a 20GB file in chunks without exceeding my RAM - R

Question

I'm currently trying to read a 20GB file. I only need 3 columns of that file. My problem is, that I'm limited to 16 GB of ram. I tried using readr and processing the data in chunks with the function read_csv_chunked and read_csv with the skip parameter, but those both exceeded my RAM limits. Even the read_csv(file, ..., skip = 10000000, nrow = 1) call that reads one line uses up all my RAM.

My question now is, how can I read this file? Is there a way to read chunks of the file without using that much ram?

Welcome to stackoverflow. I'd suggest importing the column you need only by trying `data <- fread(csvfile, header=TRUE, select=, verbose=TRUE)`. where colnumber is the index of the column you want to keep. I doubt it will be enough though. So you might have to import your data in chunks. Do what you need to do and write it back to the disk. — DJJ, Mar 30 '20 at 12:18
My approach is posted here: https://stackoverflow.com/a/68693819/8079808 — San, Aug 07 '21 at 15:59

G. Grothendieck · Accepted Answer · 2020-03-30T18:39:40.180

The LaF package can read in ASCII data in chunks. It can be used directly or if you are using dplyr the chunked package uses it providing an interface for use with dplyr.

The readr package has readr_csv_chunked and related functions.

The section of this web page entitled The Loop as well as subsequent sections of that page describes how to do chunked reads with base R.

It may be that if you remove all but the first three columns that it will be small enough to just read it in and process in one go.

vroom in the vroom package can read in files very quickly and also has the ability to read in just the columns named in the select= argument which may make it small enough to read it in in one go.

fread in the data.table package is a fast reading function that also supports a select= argument which can select only specified columns.

read.csv.sql in the sqldf (also see github page) package can read a file larger than R can handle into a temporary external SQLite database which it creates for you and removes afterwards and reads the result of the SQL statement given into R. If the first three columns are named col1, col2 and col3 then try the code below. See ?read.csv.sql and ?sqldf for the remaining arguments which will depend on your file.

library(sqldf)
DF <- read.csv.sql("myfile", "select col1, col2, col3 from file", 
  dbname = tempfile(), ...)

read.table and read.csv in the base of R have a colClasses=argument which takes a vector of column classes. If the file has nc columns then use colClasses = rep(c(NA, "NULL"), c(3, nc-3)) to only read the first 3 columns.

Another approach is to pre-process the file using cut, sed or awk (available natively in UNIX and in the Rtools bin directory on Windows) or any of a number of free command line utilities such as csvfix outside of R to remove all but the first three columns and then see if that makes it small enough to read in one go.

Also check out the High Performance Computing task view.

score 3 · Answer 2 · answered Mar 30 '20 at 11:34

We can try something like this, first a small example csv:

X = data.frame(id=1:1e5,matrix(runi(1e6),ncol=10))
write.csv(X,"test.csv",quote=F,row.names=FALSE)

You can use the nrow function, instead of providing a file, you provide a connection, and you store the results inside a list, for example:

data = vector("list",200)

con = file("test.csv","r")
data[[1]] = read.csv(con, nrows=1000)
dim(data[[1]])
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,1:3]
head(data[[1]])

  id         X1        X2         X3
1  1 0.13870273 0.4480100 0.41655108
2  2 0.82249489 0.1227274 0.27173937
3  3 0.78684815 0.9125520 0.08783347
4  4 0.23481987 0.7643155 0.59345660
5  5 0.55759721 0.6009626 0.08112619
6  6 0.04274501 0.7234665 0.60290296

In the above, we read the first chunk, collected the colnames and subsetted. If you carry on reading through the connection, the headers will be missing, and we need to specify that:

for(i in 2:200){
data[[i]] = read.csv(con, nrows=1000,col.names=COLS,header=FALSE)[,1:3]
}

Finally, we build of all of those into a data.frame:

data = do.call(rbind,data)
all.equal(data[,1:3],X[,1:3])
[1] TRUE

You can see that I specified a much larger list than required, this is to show if you don't know how long the file is, as you specify something larger, it should work. This is a bit better than writing a while loop..

So we wrap it into a function, specifying the file, number of rows to read at one go, the number of times, and the column names (or position) to subset:

read_chunkcsv=function(file,rows_to_read,ntimes,col_subset){

    data = vector("list",rows_to_read)
    con = file(file,"r")
    data[[1]] = read.csv(con, nrows=rows_to_read)
    COLS = colnames(data[[1]])
    data[[1]] = data[[1]][,col_subset]

    for(i in 2:ntimes){
    data[[i]] = read.csv(con,
    nrows=rows_to_read,col.names=COLS,header=FALSE)[,col_subset]
    }

    return(do.call(rbind,data))
    }

all.equal(X[,1:3],
read_chunkcsv("test.csv",rows_to_read=10000,ntimes=10,1:3))

I can only accept one anwer, but this solution worked aswell. I think my problem was, not to open the file in a connection with `file(path, "r")` — David, Mar 30 '20 at 12:45
yes — you must open the file read-only. If you open it read/write, every time you finish reading the cursor goes back to the start, even if you didn't read the whole file. — JDL, Mar 30 '20 at 12:53
Just ran an experiment and found that `con = gzfile("test.csv.gz","r")` did the job when starting with a gzipped csv file. — San, Aug 08 '21 at 14:29
It has been said above that _"You can see that I specified a much larger list than required, this is to show if you don't know how long the file is, as you specify something larger, it should work. This is a bit better than writing a while loop.."_. I'm curious as to why is it a bit better than writing a `while` loop. How to read through a `while` loop till the end of csv file? — San, Aug 08 '21 at 14:36
I guess `readr::read_csv_chunked` was designed to deal with such scenario. See https://readr.tidyverse.org/reference/read_delim_chunked.html — San, Aug 08 '21 at 14:54

Read a 20GB file in chunks without exceeding my RAM - R

2 Answers2

Linked