29

I need to rbind two large data frames. Right now I use

df <- rbind(df, df.extension)

but I (almost) instantly run out of memory. I guess its because df is held in the memory twice. I might see even bigger data frames in the future, so I need some kind of in-place rbind.

So my question is: Is there a way to avoid data duplication in memory when using rbind?

I found this question, which uses SqlLite, but I really want to avoid using the hard drive as a cache.

Community
  • 1
  • 1
Sebastian
  • 3,679
  • 3
  • 19
  • 14
  • 3
    @Dwin: are you paying? If so, can you buy some for me too? ;-) – Joshua Ulrich Aug 17 '11 at 14:25
  • If I were working for myself it would pay for itself in increased productivity, and when I have posed that argument to my current employer it was accepted as a "business case". – IRTFM Aug 17 '11 at 14:41
  • 1
    @DWin: Two issues: 1: I've learned that (re-)coding time requires a TARDIS. 2: Beyond a particular sweet spot, it is better to memory map than to get more RAM. Often, one's objective function for HPC is multidimensional. – Iterator Aug 17 '11 at 15:02
  • Tell us the dimensions of both dfs. Seems like `object.size(df) >> object.size(df.extension)`, right? Also, can we safely assume both their columns are identical in number, name, type, factor levels? so we don't need to check, fill missing columns, NAs etc? – smci Sep 03 '18 at 02:08
  • `dplyr::bind_rows()` seems to be working much better than `rbind()`. Have you tried it? – Karthik Thrikkadeeri Feb 20 '22 at 18:41

4 Answers4

19

data.table is your friend!

C.f. http://www.mail-archive.com/r-help@r-project.org/msg175877.html


Following up on nikola's comment, here is ?rbindlist's description (new in v1.8.2) :

Same as do.call("rbind",l), but much faster.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • 5
    Plus, version 1.8.2 of `data.table` has the `rbindlist` function, which will be helpful there. – nikola Aug 27 '12 at 19:36
  • Note that rbindlist doesn't check column names, which is part of the reason it's faster. – naught101 Sep 17 '14 at 05:26
  • 2
    Note that rbindlist doesn't check column names, which is part of the reason it's faster. `dplyr`'s rbind_all is slightly slower, but does to column name checking, so sometimes it can be more useful. – naught101 Sep 17 '14 at 05:35
18

First of all : Use the solution from the other question you link to if you want to be safe. As R is call-by-value, forget about an "in-place" method that doesn't copy your dataframes in the memory.

One not advisable method of saving quite a bit of memory, is to pretend your dataframes are lists, coercing a list using a for-loop (apply will eat memory like hell) and make R believe it actually is a dataframe.

I'll warn you again : using this on more complex dataframes is asking for trouble and hard-to-find bugs. So be sure you test well enough, and if possible, avoid this as much as possible.

You could try following approach :

n1 <- 1000000
n2 <- 1000000
ncols <- 20
dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))

dtf <- list()

for(i in names(dtf1)){
  dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}

attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"

It erases rownames you actually had (you can reconstruct them, but check for duplicate rownames!). It also doesn't carry out all the other tests included in rbind.

Saves you about half of the memory in my tests, and in my test both the dtfcomb and the dtf are equal. The red box is rbind, the yellow one is my list-based approach.

enter image description here

Test script :

n1 <- 3000000
n2 <- 3000000
ncols <- 20

dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))

gc()
Sys.sleep(10)
dtfcomb <- rbind(dtf1,dtf2)
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtfcomb)
gc()
Sys.sleep(10)
dtf <- list()
for(i in names(dtf1)){
  dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}
attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtf)
gc()
Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • 1
    While "not advisable" it looks like fun. However, your plot lacks axes and scales. ;-) – Iterator Aug 17 '11 at 15:09
  • 1
    +1 for the memory measurement. Needs more work to handle factors (and other columns with attributes) since c(a,b) removes all attributes. – Tommy Aug 17 '11 at 15:48
  • @Tommy : definitely sure. Hence my warnings. I didn't mention it specifically off course, but I didn't have time to build in all the controls. To say it with my favorite phrase : "I leave that to the reader as an exercise" ;) – Joris Meys Aug 18 '11 at 08:55
11

Right now I worked out the following solution:

nextrow = nrow(df)+1
df[nextrow:(nextrow+nrow(df.extension)-1),] = df.extension
# we need to assure unique row names
row.names(df) = 1:nrow(df)

Now I don't run out of memory. I think its because I store

object.size(df) + 2 * object.size(df.extension)

while with rbind R would need

object.size(rbind(df,df.extension)) + object.size(df) + object.size(df.extension). 

After that I use

rm(df.extension)
gc(reset=TRUE)

to free the memory I don't need anymore.

This solved my problem for now, but I feel that there is a more advanced way to do a memory efficient rbind. I appreciate any comments on this solution.

Sebastian
  • 3,679
  • 3
  • 19
  • 14
  • That's as much 'in place' as you can make it. It uses about the same amount of memory as my solution, and has less chance on bugs. Very nice. Plus, why would you want something more complicated as this works without complications? The only thing is that you lose your original df, but if that's not a problem, yours is the best solution. – Joris Meys Aug 18 '11 at 09:02
  • @Joris: thanks. I'm aware that I loose the original df, but that's a compromise I have to take. Thumbs up for you memory performance analysis. – Sebastian Aug 23 '11 at 08:44
  • Seems like `object.size(df) >> object.size(df.extension)`, right? – smci Sep 03 '18 at 02:09
  • The more advanced way to genuinely do in-place `rbind` is [`data.table::rbind_all`, per Ari Friedman's answer](https://stackoverflow.com/questions/7093984/memory-efficient-alternative-to-rbind-in-place-rbind/12018716#12018716) – smci Sep 03 '18 at 02:12
5

This is a perfect candidate for bigmemory. See the site for more information. Here are three usage aspects to consider:

  1. It's OK to use the HD: Memory mapping to the HD is much faster than practically any other access, so you may not see any slowdowns. At times I rely upon > 1TB of memory-mapped matrices, though most are between 6 and 50GB. Moreover, as the object is a matrix, this requires no real overhead of rewriting code in order to use the object.
  2. Whether you use a file-backed matrix or not, you can use separated = TRUE to make the columns separate. I haven't used this much, because of my 3rd tip:
  3. You can over-allocate the HD space to allow for a larger potential matrix size, but only load the submatrix of interest. This way there is no need to do rbind.

Note: Although the original question addressed data frames and bigmemory is suitable for matrices, one can easily create different matrices for different types of data and then combine the objects in RAM to create a dataframe, if it's really necessary.

Iterator
  • 20,250
  • 12
  • 75
  • 111
  • errr... we're talking dataframes here, and far from every dataframe is transformable to a matrix. Think a dataframe with an integer and a factor for example... – Joris Meys Aug 17 '11 at 15:02
  • @Joris: We were thinking the same thought at the same time, sir. :) See my update. – Iterator Aug 17 '11 at 15:04
  • how would you deal with factors then? Plus, you lose all other functionality of dataframes. – Joris Meys Aug 17 '11 at 15:07
  • Just store the factor levels separately and convert to and from, using integers. I haven't sent factors to `bigmatrix`, as I generally handle stratification on my own. I have had factors mutilated too often by other R code to really use them, anyway. For such data, I almost always stick with integers and use variable names to indicate the type. If something handles factors responsibly, I do a conversion prior to passing the data. – Iterator Aug 17 '11 at 15:13
  • (Continued) The one exception being factors that are character strings. Mutilation of these can be easier to detect. – Iterator Aug 17 '11 at 15:14
  • Sounds fair. Still, I'd like to see some testing of the conversion to dataframe. If you're thinking of using data.frame or as.data.frame, be prepared for memory-hell :) – Joris Meys Aug 17 '11 at 15:17
  • Regarding the "other functionality" (not being sarcastic, just reproducing @Joris' mention) of data frames, I may have become biased: I have moved so much data to bigmemory that I don't think I use data frames that much any more. (I am intrigued by data.tables though.) So, Joris asks a fair question, but lacking a "big.data.frame" package, I "compromised" and moved to matrices in order to work with large amounts of data. – Iterator Aug 17 '11 at 15:17
  • We think the same thing. I agree. I simply do not use data.frames that much anymore. I know I make a special conversion when passing them to `ggplot`, but most of my data objects are small lists and large matrices. – Iterator Aug 17 '11 at 15:19