2

I have about 1500 csv files which I want to load to my Rstudio. I am going to do use rbind() each of csv file one by one. (using for loop) I predict that total estimated number of rows is 1.6 million. Then I want to load that completed data frame to mySQL server. So is it possible to have 1.6 million rows of data in a data frame?

pnuts
  • 58,317
  • 11
  • 87
  • 139
jamryu
  • 668
  • 10
  • 24
  • highlight your question. copy, paste into a search engine. [this](http://stackoverflow.com/questions/5233769/practical-limits-of-r-data-frame) Another [here](http://stackoverflow.com/questions/9984283/maximum-size-of-a-matrix-in-r) Unless you have a lot of spare time, don't use a for loop for that. – Pierre L Jun 23 '15 at 14:34
  • I am currently using a data.frame with 1433056 rows. – kasterma Jun 23 '15 at 14:35
  • Use `data.table` and `fread`. Something along the lines of `dt = rbindlist(lapply(list.files(".", ".csv"), fread, header = T))` to read in all the files – Vlo Jun 23 '15 at 14:38
  • You are limited primarily only by the RAM on your machine. But you really are not going to want to `rbind` 1500 csv files in a for loop. That will be very slow. – joran Jun 23 '15 at 14:39

1 Answers1

5

I am going to do use rbind() each of csv file one by one. (using for loop)

This is a bad idea because growing objects with iterative calls to rbind is very slow in R (see the second circle of the R inferno for details). You will probably find it more efficient to read in all the files and combine them in a single call to rbind:

do.call(rbind, lapply(file.list, read.csv))

Is it possible to have 1.6 million rows of data in a data frame?

You can find out pretty easily:

dat <- data.frame(X=rep(0, 1600000))
str(dat)
# 'data.frame': 1600000 obs. of  1 variable:
#  $ X: num  0 0 0 0 0 0 0 0 0 0 ...

Not only can you initialize a data frame with 1.6 million rows, but you can do it in under 0.1 seconds (on my machine).

josliber
  • 43,891
  • 12
  • 98
  • 133
  • I wonder if it wouldn't be even more efficient just to concatenate/append the files via a system call and read it in all at once? I haven't tested it but I suspect even the single call to `rbind()` would be interminably slow? And if the overall goal is to load it into MySQL anyway why use R as a go-between? I would just do a [`LOAD DATA INFILE`](http://dev.mysql.com/doc/refman/5.0/en/load-data.html) and then off you go. – Forrest R. Stevens Jun 23 '15 at 15:21
  • @ForrestR.Stevens A single call to `rbind` could be quite quick or quite slow depending on the number of columns. Agreed that R is only useful in this workflow if OP wants to do some processing in between reading in the files and outputting to the SQL server. – josliber Jun 23 '15 at 15:24