2

In my current project I have a calculation function that runs on one element of a vector A and returns a list element that I insert into list B. The return element contains a number of large arbitrarily sized matrices that relate to the first list.

As an example let's take a function that takes an original number n and generates a random matrix of n x n.

vector.A <- sample(1:2000, 15000, replace = TRUE)

list.B <- as.list(rep(NA, length(vector.A)))

arbitraryMatrix <- function(n) {
    matrix(rnorm(n*n), ncol = n, nrow = n)
}

for ( i  in which(is.na(list.B)) ) {
    print(i)
    list.B[[i]] <- arbitraryMatrix( vector.A[i] )
}

This function slows down the larger list.B gets (in fact I'm pretty sure it will crash R before it finishes the loop). It occurred to me that no element of list.B is ever accessed again after it's created so it could be written to disk rather than taking up memory in a way that slows down the calculations.

I could write a script that would do this by saving chunks into .rda files but I was hoping someone had a more elegant solution.

The FF package looked like an interesting possibility for this http://cran.r-project.org/web/packages/ff/ff.pdf but as far as I can tell it doesn't support list objects.

Caveats:

  • I'm using a for loop because I like to be able to repair bugs that arise on the 7000th iteration without having to rerun the first 6999 iterations unnecessarily.
  • Depending on your machine edit the parameters of the code till it can run but only slowly on your
    computer.
  • The actual problem I have takes a list as its input so I'm not interested in vectorising the arbitraryMatrix function.
  • The memory problem is compounded in my actual problem as the function uses a lot of memory (it involves subsetting data frames).

EDIT: I'm considering the mmap package that maps r objects to temporary files but I'm still trying to work out how to use it for this problem.

Ben
  • 41,615
  • 18
  • 132
  • 227
Jon M
  • 1,157
  • 1
  • 10
  • 16
  • please don't put tags in titles when entirely _redundant_. – Grant Thomas May 30 '13 at 13:12
  • If no element of the list is accessed again, is there any need to put it in the list at all? – James May 30 '13 at 14:09
  • It will be accessed by a different R program later on but for this step of the process it just needs to be calculated and stored. – Jon M May 30 '13 at 14:10
  • 1
    What happens when you try [this](http://stackoverflow.com/questions/12577967/interactively-work-with-list-objects-that-take-up-massive-memory/16683577#16683577) method? – Ben May 30 '13 at 16:43
  • @Ben that seems really promising and looks close to what I was hoping for. I'll try it out and post an example if it works. – Jon M May 30 '13 at 19:32

1 Answers1

2

Here's an answer using the package. It's a good method because it has an impressively tiny memory footprint, that barely increases as the function progresses. So that's fulfilled one of your objectives.

However, it's a bad method because it has two substantial drawbacks... (1) it is incredibly slow, if you open a process monitor you can see the disk and memory swapping going on at a rather leisurely rate (on my machine, at least). In fact it's so slow I'm not sure if it gets slower as it gets further along. I haven't run it to completion, only just past the point where I got an error when I ran the function in memory (about item 350 or so) to convince myself it was better than running in memory (at which point the disk object was 73 GB). And that's second drawback, the disk object is creates is massive.

So here's hoping someone else comes along with a better answer to your question (perhaps with mmap?), I'll be most interested to see.

# set up disk storage object
library(filehash)
dbCreate("myTestDB")
db <- dbInit("myTestDB")

# put data on disk
db$A <- sample(1:2000, 15000, replace = TRUE)
db$B <- as.list(rep(NA, length(db$A)))

# function
arbitraryMatrix <- function(n) {
  matrix(rnorm(n*n), ncol = n, nrow = n)
}

# run function by accessing disk objects
for ( i  in which(is.na(db$B)) ) {
  print(i)
  db$B[[i]] <- arbitraryMatrix( db$A[i] )
}

# run function by accessing disk objects, following
# Jon's comment to treat db as a list
for ( i  in which(is.na(db$B)) ) {
  print(i)
  db[[as.character(i)]] <- arbitraryMatrix( db$A[i] )
}
# use db[[as.character(1)]] etc to access the list items 
Ben
  • 41,615
  • 18
  • 132
  • 227
  • While it may be a bit slow for the example I gave it actually works excellently for the problem I was having in my work (the function takes around 4 seconds to run under the best conditions). But I agree it would be great to see an example using some of the other functions. – Jon M May 31 '13 at 10:47
  • Also, does the read/write work faster if you treat the db as a list i.e. drop each element into db[[as.character(i)]]. At the moment I think your script may have to search within the list on the disk which might slow it down but I could be wrong. – Jon M May 31 '13 at 10:48
  • yes that's quit an interesting idea, I've added it to my answer. My quick benchmarking suggests it's not much faster (60 s vs 62 s for 150 items) but that might add up with a longer list. – Ben May 31 '13 at 22:45
  • That's good to know. I've been trying that approach out a bit and it seems very prone to corrupting the database when you add many entries. – Jon M Jun 01 '13 at 13:16
  • Yes I think we're still some way from a satisfactory solution to storing lists on disk. Seems like the best options for now are storing each list item on disk as a matrix in a separate file. – Ben Jun 01 '13 at 23:56