Reading multiple csv files faster into data.table R

Question

I have 900000 csv files which i want to combine into one big data.table. For this case I created a for loop which reads every file one by one and adds them to the data.table. The problem is that it is performing to slow and the amount of time used is expanding exponentially. It would be great if someone could help me make the code run faster. Each one of the csv files has 300 rows and 15 columns. The code I am using so far:

library(data.table)
setwd("~/My/Folder")

WD="~/My/Folder"
data<-data.table(read.csv(text="X,Field1,PostId,ThreadId,UserId,Timestamp,Upvotes,Downvotes,Flagged,Approved,Deleted,Replies,ReplyTo,Content,Sentiment"))

csv.list<- list.files(WD)
k=1

for (i in csv.list){
  temp.data<-read.csv(i)
  data<-data.table(rbind(data,temp.data))

  if (k %% 100 == 0)
    print(k/length(csv.list))

  k<-k+1
}

R may not be the right tool; see Spacedman's answer here, for example http://stackoverflow.com/a/11433740/210673 — Aaron left Stack Overflow, Jul 09 '15 at 12:00
It may be blasphemy in an [tag:r] question, but `csvstack` can make quick work of the combining: http://csvkit.readthedocs.org/en/0.9.1/scripts/csvstack.html ( `pip install csvkit` ). You'll definitely want to use `data.table::fread` on that resultant, GIANT CSV file, though. — hrbrmstr, Jul 09 '15 at 12:50
Two points: even with an approximate size of just 4 byte for every single entry the final size in memory will be 4 Bytes * 15 Columns * 300 Rows * 900000 Files / 1024^3 >= 15 GB. Using `rbind()` and other memory intense copying techniques will double the amount — Christian Borck, Jul 09 '15 at 12:54
Maybe you could merge first all csv files like `cat *.csv > merged.csv` and then import just the resulting merged.csv file. — Daniel Fischer, Jul 09 '15 at 13:03
First, why would you use `data.table` and not use `fread`? Next, don't reassign with the `<-` operator. That copies your table to a new instance of each loop cycle. — Serban Tanasa, Jul 09 '15 at 13:29

Nick Kennedy · Answer 1 · 2015-07-09T16:44:14.743

14

Presuming your files are conventional csv, I'd use data.table::fread since it's faster. If you're on a Linux-like OS, I would use the fact it allows shell commands. Presuming your input files are the only csv files in the folder I'd do:

dt <- fread("tail -n-1 -q ~/My/Folder/*.csv")

You'll need to set the column names manually afterwards.

If you wanted to keep things in R, I'd use lapply and rbindlist:

lst <- lapply(csv.list, fread)
dt <- rbindlist(lst)

You could also use plyr::ldply:

dt <- setDT(ldply(csv.list, fread))

This has the advantage that you can use .progress = "text" to get a readout of progress in reading.

All of the above assume that the files all have the same format and have a header row.

edited Jul 09 '15 at 16:44

answered Jul 09 '15 at 13:09

Nick Kennedy

12,510
2
30
52

`rbindlist(lst)` should outperform `do.call("rbind", lst)` – GSee Jul 09 '15 at 14:23
2

@Frank oops. Thanks for picking that up. – Nick Kennedy Jul 09 '15 at 16:44
IIRC `rbindlist` is only in `1.9.5` (currently, development version) – MichaelChirico Jul 09 '15 at 17:27
4

@MichaelChirico `rbindlist` has new features in the dev version but has been around for many versions. – Dean MacGregor Jul 09 '15 at 17:34
It is not working for my files, after 2 days still nothing happend – Carlo Jul 15 '15 at 13:25

score 4 · Answer 2 · edited May 23 '17 at 12:01

Building on Nick Kennedy's answer using plyr::ldply there is roughly a 50% speed increase by enabling the .parallel option while reading 400 csv files roughly 30-40 MB each.

Original answer with progress bar

dt <- setDT(ldply(csv.list, fread, .progress="text")

Enabling .parallel also with a text progress bar

library(plyr)
library(data.table)
library(doSNOW)

cl <- makeCluster(4)
registerDoSNOW(cl)

pb <- txtProgressBar(max=length(csv.list), style=3)
pbu <- function(i) setTxtProgressBar(pb, i)
dt <- setDT(ldply(csv.list, fread, .parallel=TRUE, .paropts=list(.options.snow=list(progress=pbu))))

stopCluster(cl)

score 3 · Answer 3 · edited May 23 '17 at 11:45

3

As suggested by @Repmat, use rbind.fill. As suggested by @Christian Borck, use fread for faster reads.

require(data.table)
require(plyr)

files <- list.files("dir/name")
df <- rbind.fill(lapply(files, fread, header=TRUE))

Alternatively you could use do.call, but rbind.fill is faster (http://www.r-bloggers.com/the-rbinding-race-for-vs-do-call-vs-rbind-fill/)

df <- do.call(rbind, lapply(files, fread, header=TRUE))

Or you could use the data.table package, see this

edited May 23 '17 at 11:45

Community

1
1

answered Jul 09 '15 at 12:54

Simon Mills

188
9

fread and rbind.fill work great! I use this precise combo for scraping huge lists of files off the net, so I can vouch for it! – Serban Tanasa Jul 09 '15 at 13:28
1

Is `rbind.fill` better than `rbind(..., fill=TRUE)`? – GSee Jul 09 '15 at 14:24
Yes, almost everything you can think of is faster than rbind. But you will not notice the difference, for small files or for a small number of files. – Repmat Jul 09 '15 at 17:26
@Repmat Since it's `data.table`s that are being combined in this case, I doubt `rbind.fill` is faster. The `data.table` package does some magic with `rbind` – GSee Jul 09 '15 at 22:21
I have not tested it, but I think you would need to do rbindlist to get the speed from package data.table. At least the package itself descibes rbindlist as: "Same as do.call("rbind",l), but much faster. " - But I could be wrong. – Repmat Jul 10 '15 at 12:58

Repmat · Answer 4 · 2015-07-09T13:02:30.403

2

You are growing your data table in a for loop - this is why it takes forever. If you want to keep the for loop as is, first create a empty data frame (before the loop), which has the dimensions you need (rows x columns), and place it in the RAM.

Then write to this empty frame in each iteration.

Otherwise use rbind.fill from package plyr - and avoid the loop altogehter. To use rbind.fill:

require(plyr)
data <- rbind.fill(df1, df2, df3, ... , dfN)

To pass the names of the df's, you could/should use an apply function.

edited Jul 09 '15 at 13:02

answered Jul 09 '15 at 12:07

Repmat

690
6
19

1

Could you explain how to use rbind.fill properly? – Carlo Jul 09 '15 at 12:22
While initializing the final dataframe would be better if there was a constraint against `data.table` abandoning data.table isn't going to be the best solution. – Dean MacGregor Jul 09 '15 at 17:36

score 1 · Answer 5 · answered Jul 09 '15 at 12:36

I go with @Repmat as your current solution using rbind() is copying the whole data.table in memory every time it is called (this is why time is growing exponentially). Though another way would be to create an empty csv file with only the headers first and then simply append the data of all your files to this csv-file.

write.table(fread(i), file = "your_final_csv_file", sep = ";",
            col.names = FALSE, row.names=FALSE, append=TRUE, quote=FALSE)

This way you don't have to worry about putting the data to the right indexes in your data.table. Also as a hint: fread() is the data.table file reader which is much faster than read.csv.

In generell R wouldn't be my first choice for this data munging tasks.

Mike Wise · Answer 6 · 2015-07-09T12:00:44.523

One suggestion would be to merge them first in groups of 10 or so, and then merge those groups, and so on. That has the advantage that if individual merges fail, you don't lose all the work. The way you are doing it now not only leads to exponentially slowing execution, but exposes you to having to start over from the very beginning every time you fail.

This way will also decrease the average size of the data frames involved in the rbind calls, since the majority of them will be being appended to small data frames, and only a few large ones at the end. This should eliminate the majority of the execution time that is growing exponentially.

I think no matter what you do it is going to be a lot of work.

score 0 · Answer 7 · answered Jul 09 '15 at 12:24

Some things to consider under the assumption you can trust all the input data and that each record is sure to be unique:

Consider creating the table being imported into without indexes. As indexes get huge the time involved to manage them during imports grows -- so it sounds like this may be happening. If this is your issue it would still take a long time to create indexes later.
Alternately, with the amount of data you are discussing you may want to consider a method of partitioning the data (often done via date ranges). Depending on your database you may then have individually indexed partitions -- easing index efforts.
If your demonstration code doesn't resolve down to a database file import utility then use such a utility.
It may be worth processing files into larger data sets prior to importing them. You could experiment with this by combining 100 files into one larger file before loading, for example, and comparing times.

In the event you can't use partitions (depending on the environment and the experience of the database personnel) you can use a home brewed method of seperating data into various tables. For example data201401 to data201412. However, you'd have to roll your own utilities to query across boundaries.

While decidedly not a better option it is something you could do in a pinch -- and it would allow you to retire/expire aged records easily and without having to adjust the related indexes. it would also let you load pre-processed incoming data by "partition" if desired.

Reading multiple csv files faster into data.table R

7 Answers7