7

What is the idiomatic way to collect results in a loop in R if the number of final results is not known beforehand? Here's a toy example:

results = vector('integer')
i=1L
while (i < bigBigBIGNumber)  {
    if (someCondition(i)) results = c(results, i)
    i = i+1
}
results

The problem with this example is that (I assume) it will have quadratic complexity as the vector needs to be re-allocated at every append. (Is this correct?) I'm looking for a solution that avoids this.

I found Filter, but it requires pre-generating 1:bigBigBIGNumber which I want to avoid to conserve memory. (Question: does for (i in 1:N) also pre-generate 1:N and keep it in memory?)

I could make something like a linked list like this:

results = list()
i=1L
while (i < bigBigBIGNumber)  {
    if (someCondition(i)) results = list(results, i)
    i = i+1
}
unlist(results)

(Note that this is not concatenation. It's building a structure like list(list(list(1),2),3), then flattening with unlist.)

Is there a better way than this? What is the idiomatic way that's typically used? (I am very new to R.) I'm looking for suggestion on how to tackle this type of problem. Suggestions both about compact (easy to write) and fast code are most welcome! (But I'd like to focus on fast and memory efficient.)

Szabolcs
  • 24,728
  • 9
  • 85
  • 174
  • 2
    The `c` function is used to extend either vectors or lists. If you can estimate the size then allocating with `vector("integer", size)` will help reduce the cost of extending. – IRTFM May 12 '13 at 22:25
  • @DWin Are there existing tools which extend the array in a smart way, on demand? (E.g. double the size of the preallocated array once its capacity has been reached, and avoid quadratic complexity) – Szabolcs May 12 '13 at 22:28
  • @Szabolcs, why do you think replacing `c` with `list` would help here? Unless you preallocate a list, the same problem persists, isn't it? – Arun May 12 '13 at 22:36
  • @Arun Notice that `c` concatenates (which triggers a re-allocation of the vector) while with `list` I am building a structure like `list(list(list(1),2),3)`, a linked list. When put in a loop, the latter has linear complexity, the former has a quadratic one. You can easily verify this by a small benchmark: doubling the number of elements to be appended doubles the time with `list`, and nearly quadruples it with `c`. This means that for a "large enough" result, the `list` approach will always be faster. What I did not realize when I wrote the question is that in this case "large enough" ... – Szabolcs May 12 '13 at 22:41
  • @Arun ... is impractically large. For practical size lists `c` will still be faster. – Szabolcs May 12 '13 at 22:41
  • @Szabolcs, running your list-of-list-of-... on 1e5 elements already resulted in `Error: protect() protection stack overflow`. Not quite sure what is the range of your bigBigNumber.. – Arun May 12 '13 at 22:49
  • @Arun Yes, I just ran it too, and I realized that the complexity is not linear. I do not know why. I'd be interested in how to properly implement a linked list in R, if possible at all. – Szabolcs May 12 '13 at 22:50
  • Unfortunately, a list of list here is a nested list. It does not seem to fit the linked-list paradigm... – Arun May 12 '13 at 22:55
  • @Arun Can you explain why? The idea came from Mathematica (another functional system that prefers immutable structures), where the underlying storage mechanism of a nested list like this directly corresponds to a linked list. A Mathematica list is stored as an array of pointers to other Mathematica expressions, which can be lists themselves. So it seems it's different in R. Can you explain how lists are stored in R (I'm just curious, but this is not so relevant), and whether it is possible to implement a linked list at all (a bit more relevant)? – Szabolcs May 12 '13 at 22:58
  • 1
    For now, I hope [this answer](http://stackoverflow.com/a/2051159/559784) might partially help, in that R lists are like hash map data structures.. – Arun May 12 '13 at 23:01
  • Have you read the R inferno? http://www.burns-stat.com/pages/Tutor/R_inferno.pdf – Roman Luštrik May 12 '13 at 23:21
  • @RomanLuštrik I looked at it. It's one of the documents I look at from time to time while learning. It's nice. Thanks for pointing it out. – Szabolcs May 12 '13 at 23:44

4 Answers4

6

Here is an algorithm that doubles the size of the output list as it fills up, achieving somewhat linear computation times as show the benchmark tests:

test <- function(bigBigBIGNumber = 1000) {

  n <- 10L
  results <- vector("list", n)
  m <- 0L
  i <- 1L
  while (i < bigBigBIGNumber)  {
    if (runif(1) > 0.5) {
      m <- m + 1L
      results[[m]] <- i
      if (m == n) {
        results <- c(results, vector("list", n))
        n <- n * 2L
      }
    }
    i = i + 1L
  }
  unlist(results)
}

system.time(test(1000))
#    user  system elapsed 
#   0.008   0.000   0.008 
system.time(test(10000))
#    user  system elapsed 
#   0.090   0.002   0.093 
system.time(test(100000))
#    user  system elapsed 
#   0.885   0.051   0.936 
system.time(test(1000000))
#    user  system elapsed 
#   9.428   0.339   9.776 
flodel
  • 87,577
  • 21
  • 185
  • 223
  • Thanks, this is very practical, so I'll accept it, but the other answers/comments helped too in understanding what people consider idiomatic in R. – Szabolcs May 12 '13 at 22:52
  • I guess the linearity is really overhead in the loop (generating random numbers, assigning results, etc); the time for growing is equal to just (e.g., for 2^20 elements) `system.time({ x = integer(1); for (i in 1:19) x <- c(x, integer(2^i)) })` (a fraction of a second). – Martin Morgan May 13 '13 at 00:25
4

Presumably there's a maximum size you're willing to tolerate; pre-allocate and fill up to that level, then trim if necessary. This avoids the risk of not being able to satisfy the request to double in size, even when only a small additional amount of memory might be required; it fails early, and involves only one rather than log(n) re-allocations. Here's a function that takes a maximum size, a generating function, and a token that the generating function returns when there is nothing left to generate. We get up to n results before returning

filln <-
    function(n, FUN, ..., RESULT_TYPE="numeric", DONE_TOKEN=NA_real_)
{
    results <- vector(RESULT_TYPE, n)
    i <- 0L
    while (i < n) {
        ans <- FUN(..., DONE_TOKEN=DONE_TOKEN)
        if (identical(ans, DONE_TOKEN))
            break
        i <- i + 1L
        results[[i]] <- ans
    }

    if (i == n)
        warning("intolerably large result")
   else length(results) <- i
   results
}

Here's a generator

fun <- function(thresh, DONE_TOKEN) {
    x <- rnorm(1)
    if (x > thresh) DONE_TOKEN else x
}

and in action

> set.seed(123L); length(filln(10000, fun, 3))
[1] 163
> set.seed(123L); length(filln(10000, fun, 4))
[1] 10000
Warning message:
In filln(10000, fun, 4) : intolerably large result
> set.seed(123L); length(filln(100000, fun, 4))
[1] 23101

We can benchmark the overhead, approximately, by comparing to something that knows in advance how much space is required

f1 <- function(n, FUN, ...) {
    i <- 0L
    result <- numeric(n)
    while (i < n) {
        i <- i + 1L
        result[i] <- FUN(...)
    }
    result
}

Here we check the timing and value of a single result

>     set.seed(123L); system.time(res0 <- filln(100000, fun, 4))
   user  system elapsed 
  0.944   0.000   0.948 
>     set.seed(123L); system.time(res1 <- f1(23101, fun, 4))
   user  system elapsed 
  0.688   0.000   0.689 
> identical(res0, res1)
[1] TRUE

which for this example is of course eclipsed by the simple vector solution(s)

set.seed(123L); system.time(res2 <- rnorm(23101))
identical(res0, res2)
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
2

If you can't compute 1:bigBigNumber, count the entries, create the vector, then populate it.

num <- 0L
i <- 0L
while (i < bigBigNumber) {
   if (someCondition(i)) num <- num + 1L 
   i <- i + 1L
}
result <- integer(num)
num <- 0L
while (i < bigBigNumber) { 
  if (someCondition(i)) { 
     result[num] <- i
     num <- num + 1L } 
  i <- i + 1L
}

(This code is not tested.)

If you can compute 1:bigBigBIGNumber, this will also work:

I assume that you want to call a function, and not simply tack on the indices themselves. Something like this may be closer to what you want:

values <- seq(bigBigBIGNumber)
sapply(values[someCondition(values)], my_function)
Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • +1 for pointing out the potential value of vectorization in `values[someCondition(values)]` – Szabolcs May 12 '13 at 22:53
  • The problem being, of course, that `someCondition(i)` is calculated *twice*. If this is a complex calculation, then doing it twice may eat up any performance gains. [I have a question along these lines](https://stackoverflow.com/q/52440778/452096), any thoughts would be appreciated. – Stephan Kolassa Sep 21 '18 at 10:04
1

closer to the second one you listed:

  results <- list()
  for (i in ...)  {
      ...
     results[[i]]  <- ...
 }

Note that i does not need to be an integer, could be a character, etc.

Also, you can use results[[length(results)]] <- ... if needed, but if you have an iterator already, probably wouldnt be.

Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • Does this solve the two problems I asked about, i.e. **1.** does it pre-generate all the values to iterate through (I don't want to keep all of them in memory) and **2.** does appending make it of quadratic complexity, i.e. does appending as `results[[i]] <- ...` cause the whole list to be re-allocated? – Szabolcs May 12 '13 at 22:00
  • Some benchmarking shows that it fails on both point *1.* and point *2.* from my comment. However, it also shows that this way of appending to a list is faster than the linked list approach I tried for up to a pretty large length (more than `100000`), and also that looping in `R` this way is so slow that I'll usually "run out of time" before I'd run out of memory when using `for`. – Szabolcs May 12 '13 at 22:14
  • If efficiency is of extreme concern, you probably want to look outside of base `R`. `rcpp` comes to mind. http://cran.r-project.org/web/packages/Rcpp/index.html – Ricardo Saporta May 12 '13 at 22:25
  • 2
    Any assignment will have "quadratic complexity" in R. There is always at least one temporary copy, sometimes more, made before the assignment is completed. The advice given is to have at least 3 times the RAM needed to hold the largest object in your workspace. (Consumed at a rate of roughly 10 bytes per 'double'.) – IRTFM May 12 '13 at 22:26
  • @DWin Do you mean that even `x=1:10; x[[3]] <- 10` re-allocates the *complete* array? Surely it won't do this, as you suggested pre-allocation yourself in your other comment. The quadratic complexity is not of the assignment, but of repeated appending `n` times. (That will take `O(n^2)` time.) – Szabolcs May 12 '13 at 22:32
  • @Szabolcs, why are you indexing a vector with `[[]]` instead of `[]`? – Arun May 12 '13 at 22:39
  • @DWin, I am not understanding whether you're saying Ricardo's solution is efficient or not... Could you please clarify? The list is *not* preallocated. This would be equally inefficient I'd suppose. – Arun May 12 '13 at 22:42
  • @Arun Just assume `[` if you like it better and if it makes no difference. – Szabolcs May 12 '13 at 22:43
  • @Szabolcs as long as you're not accessing a list, `[[` and `[` doesn't make a difference, yes. – Arun May 12 '13 at 22:57
  • I'm saying that the inefficiency cannot be eliminated. Copy on assignment in R is just m=not avoidable without special coding strategies that are arguably outside the scope of the language proper. If you want efficiency I suggest you look into the data.table package or Rcpp. There are quite a few worked examples of data.table application in the SO archives. – IRTFM May 12 '13 at 23:00