Growing a data.frame in a memory-efficient manner

Question

According to Creating an R dataframe row-by-row, it's not ideal to append to a data.frame using rbind, as it creates a copy of the whole data.frame each time. How do I accumulate data in R resulting in a data.frame without incurring this penalty? The intermediate format doesn't need to be a data.frame.

Edited to make clear what I'm pretty sure you meant. Please revert if I messed up. — Ari B. Friedman, Jul 14 '12 at 18:50
If you are still interested, [here is another benchmark of other set of different way to grow data.frame](http://stackoverflow.com/questions/20689650/how-to-append-rows-to-an-r-data-frame/38052208#38052208) when you don't know the size in advance. — Adam Ryczkowski, Jun 27 '16 at 17:38

Ari B. Friedman · Accepted Answer · 2012-07-18T11:04:28.147

First approach

I tried accessing each element of a pre-allocated data.frame:

res <- data.frame(x=rep(NA,1000), y=rep(NA,1000))
tracemem(res)
for(i in 1:1000) {
  res[i,"x"] <- runif(1)
  res[i,"y"] <- rnorm(1)
}

But tracemem goes crazy (e.g. the data.frame is being copied to a new address each time).

Alternative approach (doesn't work either)

One approach (not sure it's faster as I haven't benchmarked yet) is to create a list of data.frames, then stack them all together:

makeRow <- function() data.frame(x=runif(1),y=rnorm(1))
res <- replicate(1000, makeRow(), simplify=FALSE ) # returns a list of data.frames
library(taRifx)
res.df <- stack(res)

Unfortunately in creating the list I think you will be hard-pressed to pre-allocate. For instance:

> tracemem(res)
[1] "<0x79b98b0>"
> res[[2]] <- data.frame()
tracemem[0x79b98b0 -> 0x71da500]:

In other words, replacing an element of the list causes the list to be copied. I assume the whole list, but it's possible it's only that element of the list. I'm not intimately familiar with the details of R's memory management.

Probably the best approach

As with many speed or memory-limited processes these days, the best approach may well be to use data.table instead of a data.frame. Since data.table has the := assign by reference operator, it can update without re-copying:

library(data.table)
dt <- data.table(x=rep(0,1000), y=rep(0,1000))
tracemem(dt)
for(i in 1:1000) {
  dt[i,x := runif(1)]
  dt[i,y := rnorm(1)]
}
# note no message from tracemem

But as @MatthewDowle points out, set() is the appropriate way to do this inside a loop. Doing so makes it faster still:

library(data.table)
n <- 10^6
dt <- data.table(x=rep(0,n), y=rep(0,n))

dt.colon <- function(dt) {
  for(i in 1:n) {
    dt[i,x := runif(1)]
    dt[i,y := rnorm(1)]
  }
}

dt.set <- function(dt) {
  for(i in 1:n) {
    set(dt,i,1L, runif(1) )
    set(dt,i,2L, rnorm(1) )
  }
}

library(microbenchmark)
m <- microbenchmark(dt.colon(dt), dt.set(dt),times=2)

(Results shown below)

Benchmarking

With the loop run 10,000 times, data table is almost a full order of magnitude faster:

Unit: seconds
          expr        min         lq     median         uq        max
1    test.df()  523.49057  523.49057  524.52408  525.55759  525.55759
2    test.dt()   62.06398   62.06398   62.98622   63.90845   63.90845
3 test.stack() 1196.30135 1196.30135 1258.79879 1321.29622 1321.29622

benchmarks

And comparison of := with set():

> m
Unit: milliseconds
          expr       min        lq    median       uq      max
1 dt.colon(dt) 654.54996 654.54996 656.43429 658.3186 658.3186
2   dt.set(dt)  13.29612  13.29612  15.02891  16.7617  16.7617

Note that n here is 10^6 not 10^5 as in the benchmarks plotted above. So there's an order of magnitude more work, and the result is measured in milliseconds not seconds. Impressive indeed.

As far as I can tell, your last example doesn't grow the data.table. You simply overwrite the first row 1,000 times. — Andrie, Jul 14 '12 at 19:25
That's good but have you seen the speed example at the bottom of `?":="` comparing `:=` within a loop to `set()` within a loop. `:=` has overhead (e.g. checking the existence and type of arguments passed to `[.data.table`), which is why `set()` is provided for use inside loops. — Matt Dowle, Jul 16 '12 at 10:55
And overhead would explain why it wasn't the fastest on small data. Change to `set()` and it should be the fastest again. — Matt Dowle, Jul 16 '12 at 10:57
@MatthewDowle Neat tip, thanks. I couldn't find anything about `set()` documented in `?":="`, though, and even `?set` has only a comment that it "should be documented in ?":=", perhaps." — Ari B. Friedman, Jul 16 '12 at 11:49
Ah yes, `set` is now (recently) better documented and lives in `?":="`. It's thanks to discussions with @JoshuaUlrich here on S.O. that `set()` got added to `data.table`. Search NEWS for string "set(" for further info. — Matt Dowle, Jul 16 '12 at 12:31
@Ari v1.8.2 is now on CRAN (last night) passing checks and working its way through the CRAN system. — Matt Dowle, Jul 18 '12 at 10:00

score 9 · Answer 2 · answered Jul 14 '12 at 20:09

9

You could also have an empty list object where elements are filled with dataframes; then collect the results at the end with sapply or similar. An example can be found here. This will not incur the penalties of growing an object.

answered Jul 14 '12 at 20:09

Marc in the box

11,769
4
47
97

score 7 · Answer 3 · edited May 23 '17 at 12:25

Well, I am very surprised that nobody mentioned the conversion to a matrix yet...

Comparing with the dt.colon and dt.set functions defined by Ari B. Friedman , the conversion to a matrix has the best running time (slightly quicker than dt.colon). All affectations inside a matrix are done by reference, so there is no unnecessary memory copy performed in this code.

CODE:

library(data.table)
n <- 10^4
dt <- data.table(x=rep(0,n), y=rep(0,n))

use.matrix <- function(dt) {
  mat = as.matrix(dt)  # converting to matrix
  for(i in 1:n) {
    mat[i,1] = runif(1)
    mat[i,2] = rnorm(1)
  }
  return(as.data.frame(mat))  # converting back to a data.frame
}


dt.colon <- function(dt) { # same as Ari's function
  for(i in 1:n) {
    dt[i,x := runif(1)]
    dt[i,y := rnorm(1)]
  }
}

dt.set <- function(dt) { # same as Ari's function
  for(i in 1:n) {
    set(dt,i,1L, runif(1) )
    set(dt,i,2L, rnorm(1) )
  }
}

library(microbenchmark)
microbenchmark(dt.colon(dt), dt.set(dt), use.matrix(dt),times=10)

RESULT:

Unit: milliseconds
           expr        min         lq     median         uq        max neval
   dt.colon(dt) 7107.68494 7193.54792 7262.76720 7277.24841 7472.41726    10
     dt.set(dt)   93.25954   94.10291   95.07181   97.09725   99.18583    10
 use.matrix(dt)   48.15595   51.71100   52.39375   54.59252   55.04192    10

Pros of using a matrix:

this is the fastest method so far
you don't have to learn/use data.table objects

Con of using a matrix:

you can only handle one data type in a matrix (in particular, if you had mixed types in the columns of your data.frame, then they will all be converted to character by the line: mat = as.matrix(dt) # converting to matrix)

score 6 · Answer 4 · edited Jan 21 '15 at 05:06

I like RSQLite for that matter: dbWriteTable(...,append=TRUE) statements while collecting, and dbReadTable statement at the end.

If the data is small enough, one can use the ":memory:" file, if it is big, the hard disk.

Of course, it can not compete in terms of speed:

makeRow <- function() data.frame(x=runif(1),y=rnorm(1))

library(RSQLite)
con <- dbConnect(RSQLite::SQLite(), ":memory:")

collect1 <- function(n) {
  for (i in 1:n) dbWriteTable(con, "test", makeRow(), append=TRUE)
  dbReadTable(con, "test", row.names=NULL)
}

collect2 <- function(n) {
  res <- data.frame(x=rep(NA, n), y=rep(NA, n))
  for(i in 1:n) res[i,] <- makeRow()[1,]
  res
}

> system.time(collect1(1000))
   User      System verstrichen 
   7.01        0.00        7.05  
> system.time(collect2(1000))
   User      System verstrichen 
   0.80        0.01        0.81

But it might look better if the data.frames have more than one row. And you do not need to know the number of rows in advance.

The idea is cool, but it [is far from efficient](http://stackoverflow.com/questions/20689650/how-to-append-rows-to-an-r-data-frame/38052208#38052208). I put it on a test on another thread. — Adam Ryczkowski, Jun 27 '16 at 17:36

score 3 · Answer 5 · answered Nov 26 '18 at 14:54

This post suggests stripping off data.frame / tibble's class attributes using as.list, assigning list elements in-place the usual way and then converting the result back to data.frame / tibble again. The computational complexity of this method grows linearly but with a very little rate of less than 10e-6.

in_place_list_bm <- function(n) {
    res <- tibble(x = rep(NA_real_, n))
    tracemem(res)
    res <- as.list(res)
    for (i in 1:n) {
        res[['x']][[i]] <- i
    }
    return(res %>% as_tibble())
}

> system.time(in_place_list_bm(10000))[[3]]
tracemem[0xd87aa08 -> 0xd87aaf8]: as.list.data.frame as.list in_place_list_bm system.time 
tracemem[0xd87aaf8 -> 0xd87abb8]: in_place_list_bm system.time 
tracemem[0xd87abb8 -> 0xe045928]: check_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
tracemem[0xe045928 -> 0xe043488]: new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
tracemem[0xe043488 -> 0xe043728]: set_tibble_class new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
[1] 0.006
> system.time(in_place_list_bm(100000))[[3]]
tracemem[0xdf89f78 -> 0xdf891b8]: as.list.data.frame as.list in_place_list_bm system.time 
tracemem[0xdf891b8 -> 0xdf89278]: in_place_list_bm system.time 
tracemem[0xdf89278 -> 0x5e00fb8]: check_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
tracemem[0x5e00fb8 -> 0x5dd46b8]: new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
tracemem[0x5dd46b8 -> 0x5dcec98]: set_tibble_class new_tibble list_to_tibble as_tibble.list as_tibble <Anonymous> withVisible freduce _fseq eval eval withVisible %>% in_place_list_bm system.time 
[1] 0.045

Here is an image from the original article:

Growing a data.frame in a memory-efficient manner

5 Answers5

Linked