Strategies for reading in CSV files in pieces?

Question

I have a moderate-sized file (4GB CSV) on a computer that doesn't have sufficient RAM to read it in (8GB on 64-bit Windows). In the past I would just have loaded it up on a cluster node and read it in, but my new cluster seems to arbitrarily limit processes to 4GB of RAM (despite the hardware having 16GB per machine), so I need a short-term fix.

Is there a way to read in part of a CSV file into R to fit available memory limitations? That way I could read in a third of the file at a time, subset it down to the rows and columns I need, and then read in the next third?

Thanks to commenters for pointing out that I can potentially read in the whole file using some big memory tricks: Quickly reading very large tables as dataframes in R

I can think of some other workarounds (e.g. open in a good text editor, lop off 2/3 of the observations, then load in R), but I'd rather avoid them if possible.

So reading it in pieces still seems like the best way to go for now.

This has been discussed in detail here, in particular JD Long's answer is quite useful: http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r — Brandon Bertelsen, Feb 19 '12 at 20:54
Sorry, that does answer first question. Apparently my SO search-fu needs honing, as I did search but couldn't find it. It leaves the second one unanswered, though: how to read in a .CSV file in pieces. — Ari B. Friedman, Feb 19 '12 at 23:16
@mdsumner Interesting. Looks like there's a read.csv.ffdf() I could use. Care to make this an answer in the linked question so I can upvote it? :-) — Ari B. Friedman, Feb 19 '12 at 23:24
The short-term fix might be asking your cluster admin how you can reserve more than 4GB on a cluster node! On our cluster, its just a case of sticking ##BSUB -m 8G (or something) in your batch submission file and then waiting slightly longer for two slots to be free on the same node. — Spacedman, Feb 20 '12 at 08:06
@Spacedman I did as you recommended. Turns out, they have a sub-cluster within the cluster that is comprised of 4 machines with 40GB of RAM each :-) — Ari B. Friedman, Feb 20 '12 at 22:07
Well if you need more our cluster has 40 nodes each with 96GB RAM. I think our cluster admin guy might be compensating for something. — Spacedman, Feb 21 '12 at 09:28
My approach to large more-than-RAM files: https://stackoverflow.com/a/68693819/8079808 — San, Aug 07 '21 at 16:08

Jacob H · Answer 1 · 2020-04-07T18:09:53.617

After reviewing this thread I noticed a conspicuous solution to this problem was not mentioned. Use connections!

1) Open a connection to your file

con = file("file.csv", "r")

2) Read in chunks of code with read.csv

read.csv(con, nrows="CHUNK SIZE",...)

Side note: defining colClasses will greatly speed things up. Make sure to define unwanted columns as NULL.

3) Do what ever you need to do

4) Repeat.

5) Close the connection

close(con)

The advantage of this approach is connections. If you omit this step, it will likely slow things down a bit. By opening a connection manually, you essentially open the data set and do not close it until you call the close function. This means that as you loop through the data set you will never lose your place. Imagine that you have a data set with 1e7 rows. Also imagine that you want to load a chunk of 1e5 rows at a time. Since we open the connection we get the first 1e5 rows by running read.csv(con, nrow=1e5,...), then to get the second chunk we run read.csv(con, nrow=1e5,...) as well, and so on....

If we did not use the connections we would get the first chunk the same way, read.csv("file.csv", nrow=1e5,...), however for the next chunk we would need to read.csv("file.csv", skip = 1e5, nrow=2e5,...). Clearly this is inefficient. We are have to find the 1e5+1 row all over again, despite the fact that we just read in the 1e5 row.

Finally, data.table::fread is great. But you can not pass it connections. So this approach does not work.

I hope this helps someone.

UPDATE

People keep upvoting this post so I thought I would add one more brief thought. The new readr::read_csv, like read.csv, can be passed connections. However, it is advertised as being roughly 10x faster.

data.table's fread has planned support for connection in next stable release, more details in [data.table#561](https://github.com/Rdatatable/data.table/issues/561) — jangorecki, Dec 02 '15 at 21:15
Note that the `read*` functions of the `iotools` package can be fed a connection. — lmo, Aug 03 '17 at 16:02
Large csv files are usually kept gzipped, in which case the connection should be specified as `con = gzfile("file.csv.gz","r")` — San, Aug 08 '21 at 14:58
This looks promising but I feel I am missing a step: How to I loop through the file? As in how do I determine I have reached the end of the file? — LuleSa, Aug 31 '22 at 13:00

score 11 · Accepted Answer · edited Jun 19 '18 at 23:14

11

You could read it into a database using RSQLite, say, and then use an sql statement to get a portion.

If you need only a single portion then read.csv.sql in the sqldf package will read the data into an sqlite database. First, it creates the database for you and the data does not go through R so limitations of R won't apply (which is primarily RAM in this scenario). Second, after loading the data into the database , sqldf reads the output of a specified sql statement into R and finally destroys the database. Depending on how fast it works with your data you might be able to just repeat the whole process for each portion if you have several.

Only one line of code accomplishes all three steps, so it's a no-brainer to just try it.

DF <- read.csv.sql("myfile.csv", sql=..., ...other args...)

See ?read.csv.sql and ?sqldf and also the sqldf home page.

edited Jun 19 '18 at 23:14

wibeasley

5,000
3
34
62

answered Feb 19 '12 at 20:59

G. Grothendieck

254,981
17
203
341

Very cool. Still seems a bit inefficient though to read in the whole file and dump most of it. It does suggest that I could just subset it down to the state which I want in SQL though, which likely solves my problem. – Ari B. Friedman Feb 19 '12 at 23:25
1

If you only need to subset it down to a specific set of rows then you can just use `read.table(..., skip = ..., nrows = ...)` – G. Grothendieck Feb 20 '12 at 00:21
I'd forgotten about that. Wow, really having a question fail day. But I learned two new things from this (`ff` package and `sqldf` both have a filter option), so perhaps worth it. – Ari B. Friedman Feb 20 '12 at 00:42
If it does not close the connection itself then try `closeAllConnections()`. – G. Grothendieck Nov 02 '18 at 16:34
In my experience, `DuckDB` performs better than `RSQLite` or `read.csv.sql` which also runs `SQLite` under the hood. – San Aug 09 '21 at 10:14

Strategies for reading in CSV files in pieces?

2 Answers2

Linked