I have about 1500 csv files which I want to load to my Rstudio. I am going to do use rbind() each of csv file one by one. (using for loop) I predict that total estimated number of rows is 1.6 million. Then I want to load that completed data frame to mySQL server. So is it possible to have 1.6 million rows of data in a data frame?
Asked
Active
Viewed 8,150 times
2
-
highlight your question. copy, paste into a search engine. [this](http://stackoverflow.com/questions/5233769/practical-limits-of-r-data-frame) Another [here](http://stackoverflow.com/questions/9984283/maximum-size-of-a-matrix-in-r) Unless you have a lot of spare time, don't use a for loop for that. – Pierre L Jun 23 '15 at 14:34
-
I am currently using a data.frame with 1433056 rows. – kasterma Jun 23 '15 at 14:35
-
Use `data.table` and `fread`. Something along the lines of `dt = rbindlist(lapply(list.files(".", ".csv"), fread, header = T))` to read in all the files – Vlo Jun 23 '15 at 14:38
-
You are limited primarily only by the RAM on your machine. But you really are not going to want to `rbind` 1500 csv files in a for loop. That will be very slow. – joran Jun 23 '15 at 14:39
1 Answers
5
I am going to do use rbind() each of csv file one by one. (using for loop)
This is a bad idea because growing objects with iterative calls to rbind
is very slow in R (see the second circle of the R inferno for details). You will probably find it more efficient to read in all the files and combine them in a single call to rbind
:
do.call(rbind, lapply(file.list, read.csv))
Is it possible to have 1.6 million rows of data in a data frame?
You can find out pretty easily:
dat <- data.frame(X=rep(0, 1600000))
str(dat)
# 'data.frame': 1600000 obs. of 1 variable:
# $ X: num 0 0 0 0 0 0 0 0 0 0 ...
Not only can you initialize a data frame with 1.6 million rows, but you can do it in under 0.1 seconds (on my machine).

josliber
- 43,891
- 12
- 98
- 133
-
I wonder if it wouldn't be even more efficient just to concatenate/append the files via a system call and read it in all at once? I haven't tested it but I suspect even the single call to `rbind()` would be interminably slow? And if the overall goal is to load it into MySQL anyway why use R as a go-between? I would just do a [`LOAD DATA INFILE`](http://dev.mysql.com/doc/refman/5.0/en/load-data.html) and then off you go. – Forrest R. Stevens Jun 23 '15 at 15:21
-
@ForrestR.Stevens A single call to `rbind` could be quite quick or quite slow depending on the number of columns. Agreed that R is only useful in this workflow if OP wants to do some processing in between reading in the files and outputting to the SQL server. – josliber Jun 23 '15 at 15:24