22

I have two data-frames df1 and df2 that each have around 10 million rows and 4 columns. I read them into R using RODBC/sqlQuery with no problems, but when I try to rbind them, I get that most dreaded of R error messages: cannot allocate memory. There have got to be more efficient ways to do an rbind more efficiently -- anyone have their favorite tricks on this they want to share? For instance I found this example in the doc for sqldf:

# rbind
a7r <- rbind(a5r, a6r)
a7s <- sqldf("select * from a5s union all select * from a6s")

Is that the best/recommended way to do it?

UPDATE I got it to work using the crucial dbname = tempfile() argument in the sqldf call above, as JD Long suggests in his answer to this question

Community
  • 1
  • 1
Prasad Chalasani
  • 19,912
  • 7
  • 51
  • 73
  • Good idea -- how do I do that -- you mean using something like `memory.limit(size = 4000)` ? – Prasad Chalasani Jan 21 '11 at 16:29
  • No. BTW, that works only in Windows. See my answer below. – aL3xa Jan 21 '11 at 16:53
  • 1
    You could always save them using `write.csv` and combine the data using the shell. See http://stackoverflow.com/questions/2477271/concatenate-text-files-with-windows-command-line-dropping-leading-lines – James Jan 21 '11 at 17:47
  • 1
    Creating `sqldf`-based versions of functions like `rbind` sounds like a fun homework challenge for someone. – Richie Cotton Jan 21 '11 at 18:36

4 Answers4

29

Rather than reading them into R at the beginning and then combining them you could have SQLite read them and combine them before sending them to R. That way the files are never individually loaded into R.

# create two sample files
DF1 <- data.frame(A = 1:2, B = 2:3)
write.table(DF1, "data1.dat", sep = ",", quote = FALSE)
rm(DF1)

DF2 <- data.frame(A = 10:11, B = 12:13)
write.table(DF2, "data2.dat", sep = ",", quote = FALSE)
rm(DF2)

# now we do the real work
library(sqldf)

data1 <- file("data1.dat")
data2 <- file("data2.dat")

sqldf(c("select * from data1", 
 "insert into data1 select * from data2", 
 "select * from data1"), 
 dbname = tempfile())

This gives:

>  sqldf(c("select * from data1", "insert into data1 select * from data2", "select * from data1"), dbname = tempfile())
   A  B
1  1  2
2  2  3
3 10 12
4 11 13

This shorter version also works if row order is unimportant:

sqldf("select * from data1 union select * from data2", dbname = tempfile())

See the sqldf home page http://sqldf.googlecode.com and ?sqldf for more info. Pay particular attention to the file format arguments since they are close but not identical to read.table. Here we have used the defaults so it was less of an issue.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Neat approach... SQL is definitely capable of chewing something that big! – aL3xa Jan 21 '11 at 17:46
  • Very useful, thanks @Gabor... I have the data in an actual SQL database, and reading the entire thing with one query was hosing my memory, which is why I had to read each half into R first using `RODBC/sqlQuery` (don't ask me why it choked on the whole thing but did not choke when I read each half). But I agree if my original data is in two flat files, yours is the best way to read them in, and avoid having the two parts in R memory. – Prasad Chalasani Jan 21 '11 at 18:46
22

Notice the data.table R package for efficient operations on objects with over several million records.

Version 1.8.2 of that package offers the rbindlist function through which you can achieve what you want very efficiently. Thus instead of rbind(a5r, a6r) you can:

library(data.table)
rbindlist(list(a5r, a6r))
nikola
  • 5,286
  • 3
  • 22
  • 19
1

For completeness in this thread on the topic of union:ing large files, try using shell commands on the files to combine them. In windows that is "COPY" command with "/B" flag. Example:

system(command =
         paste0(
           c("cmd.exe /c COPY /Y"
             , '"file_1.csv" /B'
             , '+ "file_2.csv" /B'
             , '"resulting_file.csv" /B'
           ), collapse = " "
         )
)#system

Requires that files have no header, and same delimiter etc etc. The speed and versatility of shell commands is sometimes a great benefit, so don't forget CLI-commands when mapping out dataflows.

Dan Gustafsson
  • 316
  • 2
  • 5
1

Try to create a data.frame of desired size, hence import your data using subscripts.

dtf <- as.data.frame(matrix(NA, 10, 10))
dtf1 <- as.data.frame(matrix(1:50, 5, 10, byrow=TRUE))
dtf2 <- as.data.frame(matrix(51:100, 5, 10, byrow=TRUE))
dtf[1:5, ] <- dtf1
dtf[6:10, ] <- dtf2

I guess that rbind grows object without pre-allocating its dimensions... I'm not positively sure, this is only a guess. I'll comb down "The R Inferno" or "Data Manipulation with R" tonight. Maybe merge will do the trick...

EDIT

And you should bare in mind that (maybe) your system and/or R cannot cope with something that big. Try RevolutionR, maybe you'll manage to spare some time/resources.

aL3xa
  • 35,415
  • 18
  • 79
  • 112