data.table versus tidyr::expand_grid

Question

I have

XIa <- diag(1, 3)
colnames(XIa) <- rownames(XIa) <- c("a0", "a1", "a2")
XIb <- diag(1, 2)
colnames(XIb) <- rownames(XIb) <- c("b0", "b1")
XIc <- diag(1, 2)
colnames(XIc) <- rownames(XIc) <- c("c0", "c1")

tidyr::expand_grid gives me:

tidyr::expand_grid(as.data.frame(XIa), as.data.frame(XIb), as.data.frame(XIc))
# A tibble: 12 x 7
      a0    a1    a2    b0    b1    c0    c1
    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1     0     0     1     0     1     0
 2     1     0     0     1     0     0     1
 3     1     0     0     0     1     1     0
 4     1     0     0     0     1     0     1
 5     0     1     0     1     0     1     0
 6     0     1     0     1     0     0     1
 7     0     1     0     0     1     1     0
 8     0     1     0     0     1     0     1
 9     0     0     1     1     0     1     0
10     0     0     1     1     0     0     1
11     0     0     1     0     1     1     0
12     0     0     1     0     1     0     1

How do I achieve the same result using data.table?

Clearly, there is this way:

dXIa <- data.table(XIa)
dXIb <- data.table(XIb)
dXIc <- data.table(XIc)

cbind(
  dXIa[c(rep(1:3, each = 4))],
  dXIb[c(rep(1:2, each = 2))],
  dXIc[c(rep(1:2, len = 12))]
)

    a0 a1 a2 b0 b1 c0 c1
 1:  1  0  0  1  0  1  0
 2:  1  0  0  1  0  0  1
 3:  1  0  0  0  1  1  0
 4:  1  0  0  0  1  0  1
 5:  0  1  0  1  0  1  0
 6:  0  1  0  1  0  0  1
 7:  0  1  0  0  1  1  0
 8:  0  1  0  0  1  0  1
 9:  0  0  1  1  0  1  0
10:  0  0  1  1  0  0  1
11:  0  0  1  0  1  1  0
12:  0  0  1  0  1  0  1

but that is probably not optimal/ideal.

Do any of the answers resolve your questions? If not, please comment or update your question with additional information. Thanks! — r2evans, Aug 28 '21 at 20:50

score 2 · Answer 1 · answered Aug 21 '21 at 04:34

You can use CJ but it does not work with data.table directly. Using the function cjdt from this answer you can do -

library(data.table)

dXIa <- data.table(XIa)
dXIb <- data.table(XIb)
dXIc <- data.table(XIc)

cjdt <- function(a,b){
  cj = CJ(1:nrow(a),1:nrow(b))
  cbind(a[cj[[1]],],b[cj[[2]],])
}

Reduce(cjdt, list(dXIa, dXIb, dXIc))

#    a0 a1 a2 b0 b1 c0 c1
# 1:  1  0  0  1  0  1  0
# 2:  1  0  0  1  0  0  1
# 3:  1  0  0  0  1  1  0
# 4:  1  0  0  0  1  0  1
# 5:  0  1  0  1  0  1  0
# 6:  0  1  0  1  0  0  1
# 7:  0  1  0  0  1  1  0
# 8:  0  1  0  0  1  0  1
# 9:  0  0  1  1  0  1  0
#10:  0  0  1  1  0  0  1
#11:  0  0  1  0  1  1  0
#12:  0  0  1  0  1  0  1

score 1 · Answer 2 · answered Aug 21 '21 at 22:14

As an alternative to RonakShah's use of cjdt, here's a modified version that has two more features:

Guards against 0-row frames, which should really be a no-op for the 0-row frame;
Uses a single call to cbind instead of Reduce; while reduce isn't evil here, there may be benefits with a much longer list of frames/tables; and
While not a stated constraint here, it works with data.frame and data.table alike.

cjdt2 <- function(...) {
  dots <- Filter(nrow, list(...))
  eg <- do.call(expand.grid, lapply(sapply(dots, nrow), seq_len))
  do.call(cbind, Map(function(x, i) x[i,], dots, eg))
}
cjdt2(XIa, XIb, XIc)
#    a0 a1 a2 b0 b1 c0 c1
# a0  1  0  0  1  0  1  0
# a1  0  1  0  1  0  1  0
# a2  0  0  1  1  0  1  0
# a0  1  0  0  0  1  1  0
# a1  0  1  0  0  1  1  0
# a2  0  0  1  0  1  1  0
# a0  1  0  0  1  0  0  1
# a1  0  1  0  1  0  0  1
# a2  0  0  1  1  0  0  1
# a0  1  0  0  0  1  0  1
# a1  0  1  0  0  1  0  1
# a2  0  0  1  0  1  0  1

Which you can easily wrap with setDT (either externally or mod the function).

Dean MacGregor · Answer 3 · 2021-08-26T11:12:28.037

Here's another approach that uses data.table merge

expgridDT<-function(...) {
  DTs<-list(...)
  for(jj in 1:(length(DTs)-1)) {
    DTs[[jj+1]]<-merge(DTs[[1]][,c(kfjekflj=1,.SD)], DTs[[2]][,c(kfjekflj=1,.SD)],by=.EACHI, allow.cartesian=TRUE)[,!"kfjekflj",with=FALSE]
  }
  return(DTs[[length(DTs)]][])
}

Essentially what this does is create's a dummy column on each data.table with a non-sense name (kfjekflj) to make a collision with a real column name unlikely. It sets that dummy column as the join by column. Then it merges two tables at a time with allow.cartesian turned on. It does that for every data.table that is passed to the function.

Here's a benchmark:

XIa <- diag(1, 50)
colnames(XIa) <- rownames(XIa) <- paste0("a",1:ncol(XIa))
XIb <- diag(1, 72)
colnames(XIb) <- rownames(XIb) <- paste0("b",1:ncol(XIb))
XIc <- diag(1, 80)
colnames(XIc) <- rownames(XIc) <- paste0("c",1:ncol(XIc))

XIa <- as.data.table(XIa)
XIb <- as.data.table(XIb)
XIc <- as.data.table(XIc)
microbenchmark(expgridDT(XIa, XIb, XIc), Reduce(cjdt, list(XIa, XIb, XIc)), cjdt2(XIa, XIb, XIc))

Unit: milliseconds
                              expr        min         lq       mean     median         uq        max neval
          expgridDT(XIa, XIb, XIc)   167.5827   191.6542   264.8172   203.8769   231.6937   852.2033   100
 Reduce(cjdt, list(XIa, XIb, XIc))   164.4640   217.2215   252.2262   230.7276   255.6974   689.1763   100
              cjdt2(XIa, XIb, XIc) 65611.1425 67829.0407 77024.1458 77151.0220 84385.0727 95048.6625   100

data.table versus tidyr::expand_grid

3 Answers3