Decreasing memory consumption in R -- pass by reference / data.table

Question

I've already achieved a substantial speed up (~6.5x) by moving subsetting operations from base data.frame operations to data.table operations. But I'm wondering if I can get any improvement in memory.

My understanding is that R does not natively pass-by-reference (eg. see here). So, I'm seeking a method (short of re-writing a complex function in Rcpp) to do so. data.table provides some improvement [after editing my question to include typo caught by @joshua ulrich below]. But I'm looking for a larger improvement if possible.

Another option is possibly the R.oo package, though I haven't yet found a good tutorial. (I still need to read this.
Would reference classes help at all?

In my actual use case, I'm doing simulation in parallel of numerous datasets with optimization via simulated annealing. I'd rather not re-write both simulated annealing and my loss function calculations in Rcpp due to the increased dev time and increased technical debt.

Example of problem:

What I'm largely concerned with is removing some subset of observations from a dataset and adding in another subset of observations. A very simple (nonsensical) example is given here. Is there a way to decrease memory usage? My current usage appears to pass-by-value and therefore memory usage (RAM) is roughly doubled.

library(data.table)
set.seed(444L)

df1 <- data.frame(matrix(rnorm(1e7), ncol= 10))
df2 <- data.table(matrix(rnorm(1e7), ncol= 10))

prof_func <- function(df) {
  s1 <- sample(1:nrow(df), size= 500, replace=F)
  s2 <- sample(1:nrow(df), size= 500, replace=F)
  return(rbind(df[-s1,], df[s2,]))
}

dt_m <- df_m <- vector("numeric", length= 500L)

for (i in 1:500) {

  Rprof("./DF_mem.out", memory.profiling = TRUE)
  y <- prof_func(df1)
  Rprof(NULL)
  df <- summaryRprof("./DF_mem.out", memory= "both")
  df_m[i] <- df$by.self$mem.total[which(rownames(df$by.self) == "\"rbind\"")]


  Rprof("./DT_mem.out", memory.profiling = TRUE)
  y2 <- prof_func(df2)
  Rprof(NULL)
  dt <- summaryRprof("./DT_mem.out", memory = "both")
  dt_m[i] <- dt$by.self$mem.total[which(rownames(dt$by.self) == "\"rbind\"")]

}
pryr::object_size(df1)
80 MB
pryr::object_size(df2)
80 MB

# EDITED: via typo / fix from @Joshua Ulrich.
# improvement in memory usage via DT. still not pass-by-reference
quantile(df_m, seq(0,1,.1))
    0%    10%    20%    30%    40%    50%    60%    70%    80%    90%   100% 
379.00 428.60 440.10 447.70 455.36 459.20 466.48 469.89 474.40 482.10 512.60 
quantile(dt_m, seq(0,1,.1))
    0%    10%    20%    30%    40%    50%    60%    70%    80%    90%   100% 
 76.80  84.50  84.50  92.10  92.10  92.10  92.10 107.30 116.46 130.20 157.00

Appendix:

### speed improvement:
#-----------------------------------------------
library(data.table)
library(microbenchmark)

set.seed(444L)

df1 <- data.frame(matrix(rnorm(1e7), ncol= 10))
df2 <- data.table(matrix(rnorm(1e7), ncol= 10))

microbenchmark(
  df= {
    s1 <- sample(1:nrow(df1), size= 500, replace=F)
    s2 <- sample(1:nrow(df1), size= 500, replace=F)
    df1 <- rbind(df1[-s1,], df1[s2,])
  },
  dt= {
    s1 <- sample(1:nrow(df2), size= 500, replace=F)
    s2 <- sample(1:nrow(df2), size= 500, replace=F)
    df2 <- rbind(df2[-s1,], df2[s2,])

  }, times= 100L)

Unit: milliseconds
 expr      min        lq     mean   median       uq      max neval cld
   df 672.5106 757.65188 814.1582 809.6346 864.6668 998.2290   100   b
   dt  68.1254  85.73178 139.1256 120.3613 148.8243 397.7359   100  a

I don't think there's much point to including the sample calls inside your benchmark, since they're shared. For the `rbind` thing, I guess you might save speed by rearranging the vector of row numbers and subsetting once rather than using `rbind`. `myrows = df2[, c(.I[-s1], .I[s2])]; df2[myrows]` or similar. By the way, regarding pass by reference, this may be of interest to you: http://stackoverflow.com/q/15759117/1191259 Finally, if your true application uses `sample` like this, you should consider switching to `sample.int` or at least writing `n` instead of `1:n` as the first arg. — Frank, Apr 01 '16 at 00:11
@Symbolix `rbindlist` is called by method dispatch automatically... rerunning `Rprof` now w/o the typo that Joshua Ulrich identified below — alexwhitworth, Apr 01 '16 at 00:28
Perhaps also see [this](http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-data-table) for adding/removing rows (more-or-less) by reference. a bit convoluted though. — MichaelChirico, Apr 01 '16 at 23:36

score 5 · Accepted Answer · answered Apr 01 '16 at 00:08

prof_func has an error. It calls rbind on df1 instead of it's argument (df). Fix that, and you will see reduced memory usage with the data.table object.

library(data.table)
set.seed(444L)

df1 <- data.frame(matrix(rnorm(1e7), ncol= 10))
df2 <- data.table(matrix(rnorm(1e7), ncol= 10))

prof_func <- function(df) {
  s1 <- sample(1:nrow(df), size= 500, replace=F)
  s2 <- sample(1:nrow(df), size= 500, replace=F)
  return(rbind(df[-s1,], df[s2,]))
}

dt_m <- df_m <- vector("numeric", length= 500L)

for (i in 1:100) {
  Rprof("./DF_mem.out", memory.profiling = TRUE, interval=0.01)
  y <- prof_func(df1)
  Rprof(NULL)
  df <- summaryRprof("./DF_mem.out", memory= "both")
  df_m[i] <- df$by.total["\"rbind\"","mem.total"]

  Rprof("./DT_mem.out", memory.profiling = TRUE, interval=0.01)
  y2 <- prof_func(df2)
  Rprof(NULL)
  dt <- summaryRprof("./DT_mem.out", memory = "both")
  dt_m[i] <- dt$by.total["\"rbind\"","mem.total"]
}
quantile(df_m, seq(0,1,.1))
#    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
#   0.0   0.0   0.0   0.0   0.0   0.0   0.0 413.4 432.5 455.0 485.9 
quantile(dt_m, seq(0,1,.1))
#    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
#   0.0   0.0   0.0   0.0   0.0   0.0   0.0  53.9  84.5 122.6 153.1

Well, that's a mightily annoying typo... thanks. Though I don't have the typo in my real problem, so still looking for pass-by-reference if possible — alexwhitworth, Apr 01 '16 at 00:33
@Alex: then you need to rework your example to better illustrate your actual problem, since your example currently does what you're asking for. — Joshua Ulrich, Apr 01 '16 at 00:36
the memory overhead comes in computation, not post garbage collection... if you have any thought's I'd be glad to hear them in the comments. — alexwhitworth, Apr 01 '16 at 00:46
Yes, I agree-- trying to think through that now... I'm going to accept your answer to this and see if I can come up with a better example in a separate question. — alexwhitworth, Apr 01 '16 at 00:47

Decreasing memory consumption in R -- pass by reference / data.table

Example of problem:

Appendix:

1 Answers1