0

I am wondering what the most memory efficient way to initialize a list is in R if that list is going to be used in a loop to store results. I know that growing an object in a loop can cause a serious hit in computational efficiency so I am trying to avoid that as much as possible.

My problem is as follows. I have several groups of data that I want to process individually. The gist of my code is I have a loop that runs through each group one at a time, does some t-tests, and then returns only the statistically significant results (thus variable length results for each group). So far I am initializing a list of length(groups) to store the results of each iteration.

My main question is how I should be initializing the list so that the object is not grown in the loop.

  • Is it good enough to do list = vector(mode = "list", length=length(groups)) for the initialization?
    • I am skeptical about this because it just creates a list of length(groups) but each entry is equal to NULL. My concern is that during each iteration of the loop when I go to store data into the list, it is going to recopy the object each time as the entry goes from NULL to my results vector, in which case initializing the list doesn't really do much good. I don't know how the internals of a list work, however, so it is possible that it just stores the reference to the vector being stored in the list, meaning recopying is not necessary.
  • The other option would be to initialize each element of the list to a vector of the maximum possible length the results could have.
    • This is not a big issue as the maximum number of possible valid results is known. If I took this approach I would just overwrite each vector with the results vector within the loop. Since the maximum amount of memory would already be reserved hopefully no recopying/growth would occur. I don't want to take this approach, however, if it is not necessary and the first option above is good enough.

Below is some psuedo code describing my problem

#initialize variables
results = vector(mode="list", length=length(groups)) #the line of code in question
y=1
tTests = vector(length = length(singleGroup))    

#perform analysis on each group in groups
for(group in groups)
{
  #returns a vector of p values with one entry per element in group
  tTests = tTestFunction(group) 
  results[[y]] = tTests<=0.05
  y=y+1
}   
Cole
  • 600
  • 6
  • 12
  • 2
    I think an overwhelming majority would say that yes, `vector("list", length(groups))` is the way the result list should be initialized. The question's a bit broad. – Rich Scriven Jun 24 '16 at 18:41
  • It's easy enough to test both your ideas with a toy example. – joran Jun 24 '16 at 18:45

1 Answers1

1

Your code does not work, so it is a bad example. Consider this:

x <- vector("list", length = 4)
tracemem(x)  ## trace memory copies of "x"
for (i in 1:4) x[[i]] <- rnorm(4)

No extra copy of x is made during update. So there is nothing to worry.

As suggested by @lmo, even if you use x <- list() to initialize this list, no memory copy will be incurred, either.


Comment

The aim of my answer, is to refer you to the use of tracemem, when you want to trace (possible) memory copies made during code execution. Had you known this function, you would not ask us here.

Here is my other answer made, related to using tracemem. It is in a different context, though. There, you can see what tracemem would return when memory copies are made.

Community
  • 1
  • 1
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • The code doesn't work because it is psuedo code, as stated in my question. I included it to show the logical framework of what I was attempting to accomplish, not to give a working example (since this is just a small part of a complicated code chunk). And thank you for the answer, that `tracemem` seems pretty handy. – Cole Jun 24 '16 at 18:57
  • 2
    `tracemem()` is not reporting memory reallocation, I don't know why; use `.Internal(inspect(x))` to see that the outer vector is re-allocated on each assignment (the `@...` in the first line of output is the memory address of the list; somehow, it 'has' to be, because the original allocation wasn't big enough). And while individual elements aren't duplicated, they are copied into the list, with 1 copy for i = 2, 2 copies for i = 3, ... and about n (n - 1) / 2 copies (quadratic scaling) overall. – Martin Morgan Aug 01 '16 at 21:33
  • 2
    `inspect()` is a C-level function that summarizes information about each symbol; there is more information at http://stackoverflow.com/questions/18359940/r-programming-vector-a1-2-avoid-copying-the-whole-vector/18361181#18361181 and elsewhere in StackOverflow. – Martin Morgan Aug 01 '16 at 22:03