1

Following example is based on discussion about using expand.grid with large data. As you can see it ends up with error. I guess this is due to possible combinations which is according to mentioned page 68.7 billions:

> v1 <-  c(1:8)
> v2 <-  c(1:8)
> v3 <-  c(1:8)
> v4 <-  c(1:8)
> v5 <-  c(1:8)
> v6 <-  c(1:8)
> v7 <-  c(1:8)
> v8 <-  c(1:8)
> v9 <-  c(1:8)
> v10 <- c(1:8)
> v11 <- c(1:8)
> v12 <- c(1:8)
> expand.grid(v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12)
Error in rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) : 
  invalid 'times' value
In addition: Warning message:
In rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
  NAs introduced by coercion to integer range

Even with eight vectors it kills my CPU and/or RAM (> expand.grid(v1, v2, v3, v4, v5, v6, v7, v8)). Here I've found some improvements which suggests using outer or rep.int. Those solutions works with two vectors so I've not able to apply it for 12 vectors but I guess the principle is the same: It creates large matrix which resides in memory. I'm wondering if there is something like python's xrange which evaluates lazily? Here I've found delayedAssign function but I guess this will not help because there is also mentioned following:

Unfortunately, R evaluates lazy variables when they are pointed to by a data structure, even if their value is not needed at the time. This means that infinite data structures, one common application of laziness in Haskell, are not possible in R.

Is using nested loops only solution for this problem?

PS: I have not specific problem, but suppose you need to do some computation using function which is accepting 12 integer arguments, for some reason. Also suppose that you need to make all combinations of those 12 integers and save results to file. Using 12 nested loops and saving results to file continuously will work (despite it will be slow but it will not kill your RAM). Here is shown how you can use expand.grid and apply function to replace two nested loops. Problem is that creating such matrix with 12 vectors of length 8 using expand.grid has some disadvantages:

  1. generating such matrix is slow
  2. such large matrix consumes a lot of memory (68.7 billion rows and 8 columns)
  3. further iteration over this matrix using apply is slow also

So in my point of view functional approach is much more slower than procedural solution. I'm just wondering if is possible to lazily create large data structure which in theory does not fit in to memory and iterate over it. That's all.

Community
  • 1
  • 1
Wakan Tanka
  • 7,542
  • 16
  • 69
  • 122
  • 1
    What problem? I see a bunch of links, references to "it", and something called `xrange` from Python. Can you state in a sentence or two what you're trying to accomplish? – Gregor Thomas Mar 21 '16 at 23:19
  • 1
    What are you trying to accomplish that you need to create (even the ghost of) such a massive object – MichaelChirico Mar 21 '16 at 23:21
  • 1
    there's an [iterators](https://cran.r-project.org/web/packages/iterators/index.html) package for R that might help ... – Ben Bolker Mar 21 '16 at 23:40
  • 1
    Can you please explain why I've been down voted? Please clarify what is unclear or what further information should I provide. Thanks – Wakan Tanka Mar 22 '16 at 07:56

2 Answers2

3

One (arguably more "proper") way to approach this would be to write your own iterator for iterators that @BenBolker suggested (pdf on writing extensions is here). Lacking something more formal, here is a poor-man's iterator, similar to expand.grid but manually-advancing. (Note: this will suffice given that the computation on each iteration is "more expensive" than this function itself. This could really be improved, but "it works".)

This function returns a named list (with the provided factors) each time the returned function is returned. It is lazy in that it does not expand the entire list of possibles; it is not lazy with the argument themselves, they should be 'consumed' immediately.

lazyExpandGrid <- function(...) {
  dots <- list(...)
  sizes <- sapply(dots, length, USE.NAMES = FALSE)
  indices <- c(0, rep(1, length(dots)-1))
  function() {
    indices[1] <<- indices[1] + 1
    DONE <- FALSE
    while (any(rolls <- (indices > sizes))) {
      if (tail(rolls, n=1)) return(FALSE)
      indices[rolls] <<- 1
      indices[ 1+which(rolls) ] <<- indices[ 1+which(rolls) ] + 1
    }
    mapply(`[`, dots, indices, SIMPLIFY = FALSE)
  }
}

Sample usage:

nxt <- lazyExpandGrid(a=1:3, b=15:16, c=21:22)
nxt()
#   a  b  c
# 1 1 15 21
nxt()
#   a  b  c
# 1 2 15 21
nxt()
#   a  b  c
# 1 3 15 21
nxt()
#   a  b  c
# 1 1 16 21

## <yawn>

nxt()
#   a  b  c
# 1 3 16 22
nxt()
# [1] FALSE

NB: for brevity of display, I used as.data.frame(mapply(...)) for the example; it works either way, but if a named list works fine for you then the conversion to a data.frame isn't necessary.

EDIT

Based on alexis_laz's answer, here's a much-improved version that is (a) much faster and (b) allows arbitrary seeking.

lazyExpandGrid <- function(...) {
  dots <- list(...)
  argnames <- names(dots)
  if (is.null(argnames)) argnames <- paste0('Var', seq_along(dots))
  sizes <- lengths(dots)
  indices <- cumprod(c(1L, sizes))
  maxcount <- indices[ length(indices) ]
  i <- 0
  function(index) {
    i <<- if (missing(index)) (i + 1L) else index
    if (length(i) > 1L) return(do.call(rbind.data.frame, lapply(i, sys.function(0))))
    if (i > maxcount || i < 1L) return(FALSE)
    setNames(Map(`[[`, dots, (i - 1L) %% indices[-1L] %/% indices[-length(indices)] + 1L  ),
             argnames)
  }
}

It works with no arguments (auto-increment the internal counter), one argument (seek and set the internal counter), or a vector argument (seek to each and set the counter to the last, returns a data.frame).

This last use-case allows for sampling a subset of the design space:

set.seed(42)
nxt <- lazyExpandGrid2(a=1:1e2, b=1:1e2, c=1:1e2, d=1:1e2, e=1:1e2, f=1:1e2)
as.data.frame(nxt())
#   a b c d e f
# 1 1 1 1 1 1 1
nxt(sample(1e2^6, size=7))
#      a  b  c  d  e  f
# 2   69 61  7  7 49 92
# 21  72 28 55 40 62 29
# 3   88 32 53 46 18 65
# 4   88 33 31 89 66 74
# 5   57 75 31 93 70 66
# 6  100 86 79 42 78 46
# 7   55 41 25 73 47 94

Thanks alexis_laz for the improvements of cumprod, Map, and index calculations!

Community
  • 1
  • 1
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you this is very useful post. Is also possible to use `nxt()` as a `X` argument for `*apply` function family? I guess this is not possible and the only way is to use it in loop but I might be wrong. Thank you – Wakan Tanka Mar 22 '16 at 09:47
  • If you are curious (and adventurous), I added some more functionality and made it available as a [gist](https://gist.github.com/r2evans/e5531cbab8cf421d14ed). Comments welcome. – r2evans Mar 22 '16 at 21:38
  • BTW: no, I don't think it is possible to use this in an `*apply` function. I believe they need to know the (finite) length of the list/vector ahead of time and pre-allocate the memory for the return value. (I'm not a core developer for R, though, so I might be a little off on this comment.) – r2evans Mar 22 '16 at 21:43
3

Another approach that, somehow, looks valid..:

exp_gr = function(..., index)
{
    args = list(...)
    ns = lengths(args)
    offs = cumprod(c(1L, ns))
    n = offs[length(offs)]

    stopifnot(index <= n)

    i = (index[[1L]] - 1L) %% offs[-1L] %/% offs[-length(offs)] 

    return(do.call(data.frame, 
           setNames(Map("[[", args, i + 1L), 
                    paste("Var", seq_along(args), sep = ""))))
}

In the above function, ... are the arguments to expand.grid and index is the increasing number of combinations. E.g.:

expand.grid(1:3, 10:12, 21:24, letters[2:5])[c(5, 22, 24, 35, 51, 120, 144), ]
#    Var1 Var2 Var3 Var4
#5      2   11   21    b
#22     1   11   23    b
#24     3   11   23    b
#35     2   12   24    b
#51     3   11   22    c
#120    3   10   22    e
#144    3   12   24    e
do.call(rbind, lapply(c(5, 22, 24, 35, 51, 120, 144), 
                      function(i) exp_gr(1:3, 10:12, 21:24, letters[2:5], index = i)))
#  Var1 Var2 Var3 Var4
#1    2   11   21    b
#2    1   11   23    b
#3    3   11   23    b
#4    2   12   24    b
#5    3   11   22    c
#6    3   10   22    e
#7    3   12   24    e

And on large structures:

expand.grid(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2)
#Error in rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) : 
#  invalid 'times' value
#In addition: Warning message:
#In rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
#  NAs introduced by coercion to integer range
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1)
#  Var1 Var2 Var3 Var4 Var5 Var6
#1    1    1    1    1    1    1
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e3 + 487)
#  Var1 Var2 Var3 Var4 Var5 Var6
#1   87   15    1    1    1    1
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e2 ^ 6)
#  Var1 Var2 Var3 Var4 Var5 Var6
#1  100  100  100  100  100  100
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e11 + 154)
#  Var1 Var2 Var3 Var4 Var5 Var6
#1   54    2    1    1    1   11

A similar approach to this would be to construct a "class" that stores the ... arguments to use expand.grid on and define a [ method to calculate the appropriate combination index when needed. Using %% and %/% seems valid, though, I guess iterating with these operators will be slower than it needs to be.

alexis_laz
  • 12,884
  • 4
  • 27
  • 37
  • Nice trick with `cumprod`, and having a seekable version can be very useful. (I get a `cannot allocate vector` error from your code, not `NAs introduced`. Do you know why you're getting that error?) Don't know why I haven't been using `lengths` all these years, waaaaay faster than `sapply(dots, length)`. – r2evans Mar 22 '16 at 15:25
  • One concern I see is that it rolls over. Try `exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e12+1)`. A simple bounds-check should cover this case. – r2evans Mar 22 '16 at 15:36
  • @r2evans : Thanks, you're right about the bounds. Also, unless there is a very large number of large length `...` arguments, `cumprod` seems viable (otherwise there'll be some `Inf`s). As for the error you get, it seems that I got stopped earlier on by `rep` trying to coerce to "integer" a value that was large and the resulting `NA` caused the "invalid times", whereas you got stopped later on (not by an invalid range integer but) by the large size of the to-be created object; I guess you'll get a "NAs introduced" by adding more/more-lengthy elements in `expand.grid`? – alexis_laz Mar 22 '16 at 15:48
  • (cont.).. e.g. see `rep.int(1, .Machine$integer.max)` vs `rep.int(1, .Machine$integer.max + 1)` – alexis_laz Mar 22 '16 at 15:50