Cartesian product in R memory issues

Question

I am hoping to get every possible combination of elements taken across vectors. These questions show what I am trying to do in Python (Get the cartesian product of a series of lists?) and R (Cartesian product data frame).

However, following the answer from the latter in R (https://stackoverflow.com/a/4309350/9096420), I run into memory issues (i.e., cannot allocate vector of size 86792.1 Gb; R memory management / cannot allocate vector of size n Mb). I have tried a few of the answers therein, but my vector size appears to be too big to overcome.

This leads me to think that something is wrong with how I am approaching this problem. There are many possible combinations, but this seems solvable.

My data:

    dat<-structure(list(rows = c(62L, 63L, 64L, 65L, 68L, 69L, 70L, NA, 
    NA, NA, NA, NA, NA), rows.1 = c(119L, 120L, 122L, 123L, 124L, 
    125L, NA, NA, NA, NA, NA, NA, NA), rows.2 = c(137L, 138L, 139L, 
    140L, 141L, 142L, 143L, 144L, 145L, NA, NA, NA, NA), rows.3 = c(161L, 
    162L, 163L, 164L, 165L, 166L, 167L, NA, NA, NA, NA, NA, NA), 
    rows.4 = c(168L, 169L, 170L, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA), rows.5 = c(148L, 149L, 150L, 151L, 152L, 153L, 
    154L, 155L, 156L, NA, NA, NA, NA), rows.6 = c(135L, 136L, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), rows.7 = c(108L, 
    109L, 110L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), rows.8 = c(116L, 
    117L, 118L, 121L, NA, NA, NA, NA, NA, NA, NA, NA, NA), rows.9 = c(178L, 
    180L, 181L, 182L, 183L, NA, NA, NA, NA, NA, NA, NA, NA), 
    rows.10 = c(179L, 184L, 185L, 186L, 187L, 188L, 189L, 190L, 
    191L, 192L, 193L, 194L, 195L), rows.11 = c(50L, 51L, 52L, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, 
    -13L))

dat
   rows rows.1 rows.2 rows.3 rows.4 rows.5 rows.6 rows.7 rows.8 rows.9 rows.10 rows.11
1    62    119    137    161    168    148    135    108    116    178     179      50
2    63    120    138    162    169    149    136    109    117    180     184      51
3    64    122    139    163    170    150     NA    110    118    181     185      52
4    65    123    140    164     NA    151     NA     NA    121    182     186      NA
5    68    124    141    165     NA    152     NA     NA     NA    183     187      NA
6    69    125    142    166     NA    153     NA     NA     NA     NA     188      NA
7    70     NA    143    167     NA    154     NA     NA     NA     NA     189      NA
8    NA     NA    144     NA     NA    155     NA     NA     NA     NA     190      NA
9    NA     NA    145     NA     NA    156     NA     NA     NA     NA     191      NA
10   NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     192      NA
11   NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     193      NA
12   NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     194      NA
13   NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     195      NA

My goal is to combine one value from each column with every possible combination of values from other columns (only allowing one value from each column). A small example of only two columns that works:

expand.grid(dat[,1],dat[,2])
    Var1 Var2
1     62  119
2     63  119
3     64  119
4     65  119
5     68  119
6     69  119
7     70  119
8     NA  119
9     NA  119
10    NA  119
11    NA  119
12    NA  119
13    NA  119
14    62  120
15    63  120
16    64  120
17    65  120
18    68  120
19    69  120
20    70  120
21    NA  120
22    NA  120
23    NA  120
24    NA  120
25    NA  120
26    NA  120
27    62  122
28    63  122
# ... output truncated

When I try to do it for the entire dataset, I get memory issues:

Either

expand.grid(dat)

OR

expand.grid(dat[,1],dat[,2],dat[,3],dat[,4],dat[,5],dat[,6],dat[,7],dat[,8],
            dat[,9],dat[,10],dat[,11],dat[,12])

(which I assume to be the same), both produce an error:

Error: cannot allocate vector of size 86792.1 Gb

Is there a simpler way to do this that gets around memory issues? What am I doing wrong here?

Here is another way to do it with nested for loops (but it is incredibly cumbersome because it requires a for loop for every column of data:

output<-NULL

for(h in 1:13){
  for(i in 1:13){
    for(j in 1:13){
output<-rbind(output,
          c(dat[h,1],dat[i,2],dat[j,3])
)
    }}}

output
        [,1] [,2] [,3]
   [1,]   62  119  137
   [2,]   62  119  138
   [3,]   62  119  139
   [4,]   62  119  140
   [5,]   62  119  141
   [6,]   62  119  142
   [7,]   62  119  143
   [8,]   62  119  144
   [9,]   62  119  145
  [10,]   62  119   NA
  [11,]   62  119   NA
  [12,]   62  119   NA
  [13,]   62  119   NA
  [14,]   62  120  137
  [15,]   62  120  138
  [16,]   62  120  139
  [17,]   62  120  140
  [18,]   62  120  141
  [19,]   62  120  142
  [20,]   62  120  143
  [21,]   62  120  144
  [22,]   62  120  145
  # ... output truncated

If I wanted to do this for every combination it would look like:

for(h in 1:13){
  for(i in 1:13){
    for(j in 1:13){
      for(k in 1:13){
        for(l in 1:13){
          for(m in 1:13){
            for(n in 1:13){
              for(o in 1:13){
                for(p in 1:13){
                  for(q in 1:13){
                    for(r in 1:13){
                      for(s in 1:13){
output<-rbind(output,
          c(dat[h,1],dat[i,2],dat[j,3],dat[k,4],
            dat[l,5],dat[m,6],dat[n,7],dat[o,8],
            dat[p,9],dat[q,10],dat[r,11],dat[s,12])
)
    }}}}}}}}}}}}

how many columns and rows do you have in your dataset? all possible combinations blows up data significantly. You should expect `rows ^ columns * columns` data points which you can try to use to estimate memory requirements. — Bulat, Mar 06 '21 at 18:56
@akrun no error so far - it has been running for 30 minutes. — Dylan_Gomes, Mar 06 '21 at 19:03
@Bulat it is 13x12 as shown above (i.e., that *is* the dataset). — Dylan_Gomes, Mar 06 '21 at 19:05
I don't see how you can get around it, because the formula above suggests that you need a lot of memory and error message just confirms that — Bulat, Mar 06 '21 at 19:14
If you do `x=apply(dat, 2, unique) expand.grid(x[1]$rows, x[2]$rows.1, x[3]$rows.2, x[4]$rows.3, x[5]$rows.4, x[6]$rows.5, x[7]$rows.6, x[8]$rows.7, x[9]$rows.8, x[10]$rows.9, x[11]$rows.10, x[12]$rows.11) ` you still need 3.3 billion rows in the data frame. Dunno how long that would take. As you have done it without choosing unique values, there are 23 billion combinations. — Vons, Mar 06 '21 at 19:16
Hmm, well perhaps I really am up against a wall. I wasn't sure if I was missing something, since I am not so familiar with these types of manipulations. I sure appreciate you folks taking the time to look at this for me. — Dylan_Gomes, Mar 06 '21 at 19:26
(Not 13*12 but rather 13^12, since the number of unique combinations of the first two cols needs to then be multiplied by the number of rows in the third column, and then time 13 more for the 4th col and ....) And 13 ^12 is a large number, isn't it? And each numeric value takes 10 bytes, so 13^12 ( 2.329809e+13) times 10 does appear likely to exceed most memory available on typical computers these days. I think you should try to think more carefully about why you are attempting this construction. — IRTFM, Mar 06 '21 at 20:58
When I mentioned 13x12, I was referring to the matrix size, not the number of combinations. Yes, indeed it is large. I was hoping to create all possible combinations, but seeing as this is infeasible, I will just sample from these possible combinations. — Dylan_Gomes, Mar 06 '21 at 21:41

Cartesian product in R memory issues

0 Answers0