4

Short of working on a machine with more RAM, how can I work with large lists in R, for example put them on disk and then work on sections of it?

Here's some code to generate the type of lists I'm using

n = 50; i = 100
WORD <- vector(mode = "integer", length = n)
for (i in 1:n){
  WORD[i] <- paste(sample(c(rep(0:9,each=5),LETTERS,letters),5,replace=TRUE),collapse='')
}
dat <- data.frame(WORD =  WORD,
                  COUNTS = sample(1:50, n, replace = TRUE))
dat_list <- lapply(1:i, function(i) dat) 

In my actual use case each data frame in the list is unique, unlike the quick example here. I'm aiming for n = 4000 and i = 100,000

This is one example of what I want to do with this list of dataframes:

FUNC <- function(x) {rep(x$WORD, times = x$COUNTS)}
la <- lapply(dat_list, FUNC)

With my actual use case this runs for a few hours, fills up the RAM and most of the swap and then RStudio freezes and shows a message with a bomb on it (RStudio was forced to terminate due to an error in the R session).

I see that bigmemory is limited to matrices and ff doesn't seem to handle lists. What are the other options? If sqldf or a related out-of-memory method possible here, how might I get started? I can't get enough out of the documentation to make any progress and would be grateful for any pointers. Note that instructions to "buy more RAM" will be ignored! This is for a package that I'm hoping will be suitable for average desktop computers (ie. undergrad computer labs).

UPDATE Followining up on the helpful comments from SimonO101 and Ari, here's some benchmarking comparing dataframes and data.tables, loops and lapply, and with and without gc

# self-contained speed test of untable
n = 50; i = 100
WORD <- vector(mode = "integer", length = n)
for (i in 1:n){
  WORD[i] <- paste(sample(c(rep(0:9,each=5),LETTERS,letters),5,replace=TRUE),collapse='')
}
# as data table
library(data.table)
dat_dt <- data.table(WORD = WORD, COUNTS = sample(1:50, n, replace = TRUE))
dat_list_dt <- lapply(1:i, function(i) dat_dt)

# as data frame
dat_df <- data.frame(WORD =  WORD, COUNTS = sample(1:50, n, replace = TRUE))
dat_list_df <- lapply(1:i, function(i) dat_df)

# increase object size
y <- 10
dt <- c(rep(dat_list_dt, y))
df <- c(rep(dat_list_df, y))
# untable
untable <- function(x) rep(x$WORD, times = x$COUNTS)


# preallocate objects for loop to fill
df1 <- vector("list", length = length(df))
dt1 <- vector("list", length = length(dt))
df3 <- vector("list", length = length(df))
dt3 <- vector("list", length = length(dt))
# functions for lapply
df_untable_gc <- function(x) { untable(df[[x]]); if (x%%10) invisible(gc()) }
dt_untable_gc <- function(x) { untable(dt[[x]]); if (x%%10) invisible(gc()) }
# speedtests
library(microbenchmark)
microbenchmark(
  for(i in 1:length(df)) { df1[[i]] <- untable(df[[i]]); if (i%%10) invisible(gc()) },
  for(i in 1:length(dt)) { dt1[[i]] <- untable(dt[[i]]); if (i%%10) invisible(gc()) },
  df2 <- lapply(1:length(df), function(i) df_untable_gc(i)),
  dt2 <- lapply(1:length(dt), function(i) dt_untable_gc(i)),
  for(i in 1:length(df)) { df3[[i]] <- untable(df[[i]])},
  for(i in 1:length(dt)) { dt3[[i]] <- untable(dt[[i]])},
  df4 <- lapply(1:length(df), function(i) untable(df[[i]]) ),
  dt4 <- lapply(1:length(dt), function(i) untable(dt[[i]]) ),

  times = 10)

And here are the results, without explicit garbage collection, data.table is much faster and lapply slightly faster than a loop. With explicit garbage collection (as I think SimonO101 might be suggesting) they are all much the same speed - a lot slower! I know that using gc is a bit controversial and probably not helpful in this case, but I'll give it a shot with my actual use-case and see if it makes any difference. Of course I don't have any data on memory use for any of these functions, which is really my main concern. Seems that there is no function for memory benchmarking equivalent to the timing functions (for windows, anyway).

Unit: milliseconds
                                                                                                 expr
 for (i in 1:length(df)) {     df1[[i]] <- untable(df[[i]])     if (i%%10)          invisible(gc()) }
 for (i in 1:length(dt)) {     dt1[[i]] <- untable(dt[[i]])     if (i%%10)          invisible(gc()) }
                                            df2 <- lapply(1:length(df), function(i) df_untable_gc(i))
                                            dt2 <- lapply(1:length(dt), function(i) dt_untable_gc(i))
                                         for (i in 1:length(df)) {     df3[[i]] <- untable(df[[i]]) }
                                         for (i in 1:length(dt)) {     dt3[[i]] <- untable(dt[[i]]) }
                                            df4 <- lapply(1:length(df), function(i) untable(df[[i]]))
                                            dt4 <- lapply(1:length(dt), function(i) untable(dt[[i]]))
          min           lq       median           uq         max neval
 37436.433962 37955.714144 38663.120340 39142.350799 39651.88118    10
 37354.456809 38493.268121 38636.424561 38914.726388 39111.20439    10
 36959.630896 37924.878498 38314.428435 38636.894810 39537.31465    10
 36917.765453 37735.186358 38106.134494 38563.217919 38751.71627    10
    28.200943    29.221901    30.205502    31.616041    34.32218    10
    10.230519    10.418947    10.665668    12.194847    14.58611    10
    26.058039    27.103217    27.560739    28.189448    30.62751    10
     8.835168     8.904956     9.214692     9.485018    12.93788    10
Ben
  • 41,615
  • 18
  • 132
  • 227
  • 1
    I see you are using a `data.rame` construction, in which case `data.table` could be a massive help here because it changes values by reference, i.e. it doesn't make multiple copies of the data beforehand. As I understand it, aside from the space required in RAM to hold the entire table, the only additional working RAM space required is up to the length of the longest list element (e.g. `column`). Whereas with a `data.frame` you will need much more additional RAM when working on the object. Due to change-by-reference `data.table` will also be *far* quicker. – Simon O'Hanlon May 19 '13 at 09:12
  • 1
    Also, I'm not too sure of the intricacies of the garbage collector, but when using `lapply` like this, iterating across many data.frames it's *possible* (but this is speculation) that garbage collection on already processed data.frames in memory may not happen until *after* the `lapply` loop has exited which is why the memory fills up. Perhaps some other construct/looping mechanism would be appropriate here? – Simon O'Hanlon May 19 '13 at 09:16
  • 2
    @SimonO101 I agree. I've had a few big operations fail with `lapply` that worked with a simple rewrite to a `for` loop. – Ari B. Friedman May 19 '13 at 10:56
  • Thanks for the suggestions, I'll do some tests and report back – Ben May 19 '13 at 20:01
  • Small-scale benchmarking confirms `data.table` as faster than data.frame, and lapply as faster than an explicit loop. I'll test it out on my actual use-case and check in again (might be a while!). Thanks again for your helpful suggestions. – Ben May 20 '13 at 05:06
  • For the benefit of future searchers, I think a good answer to this question could be derived from my answer [here](http://stackoverflow.com/a/16683993/1036500) – Ben May 30 '13 at 02:17

1 Answers1

1

If you really are going to be using very large data you can use the h5r package to write hdf5 files. You would be writing to and reading from your hard drive on the fly instead of using RAM. I have not used this so I can be of little help on it's general usage, I mention this because I think there's is no tutorial for it. I got this idea by thinking about pytables. Not sure if this solution is appropriate for you.

JEquihua
  • 1,217
  • 3
  • 20
  • 40
  • Thanks for the suggestion. I'm just putting this here for me to come back to: http://bioconductor.org/packages/2.11/bioc/html/rhdf5.html – Ben May 20 '13 at 00:41
  • That is a different package. From what I have heard the h5r package is said to be more robust: http://cran.r-project.org/web/packages/h5r/index.html – JEquihua May 20 '13 at 00:47
  • Yes I know, just collecting options. Can you add a bit more detail to your answer about what you know of robustness? hdf5 is completely new to me! – Ben May 20 '13 at 00:57
  • And it looks like there's not a lot of the sort of entry-level documentation that I need to get started with the hdf5 packages... Well, perhaps that's a topic for another question... – Ben May 20 '13 at 04:43
  • I noticed this too, that's why I didn't really provide much information. Sorry! – JEquihua May 20 '13 at 05:27
  • Looking at the `h5r` [help doc](http://cran.r-project.org/web/packages/h5r/h5r.pdf) it seems that a hdf5 object must be 'a matrix or vector of the same data type'. So that limits their use in this case, where I want to store lists. Seems like [tag:filehash] might be more relevant here. – Ben May 30 '13 at 02:29