0

I have the following data frame:

x <- data.frame("Col1" = c('A', 'B', 'C', 'D'), "Col2" = c('W', 'X', 'Y', 'Z'))

I want to have a new data frame with all possible combinations of row combinations, which would give a data frame that would have two columns containing something like:

A W
A X
A Y
A Z
B W
B X
B Y
B Z
C W
...

The dataframe would always have two columns but number of rows could vary.

I looked at permute() or sample() but I did not manage to get what I am looking for. Thanks!

ML_Enthousiast
  • 1,147
  • 1
  • 15
  • 39
  • try this: `expand.grid(x)`, see this [`expand.grid`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/expand.grid.html) – bouncyball Jun 20 '18 at 15:41
  • Possible duplicate of [Generate list of all possible combinations of elements of vector](https://stackoverflow.com/questions/18705153/generate-list-of-all-possible-combinations-of-elements-of-vector) – bouncyball Jun 20 '18 at 15:45
  • https://stackoverflow.com/questions/18705153 seems different enough to me to be a separate question. This scenario already has the values loaded in a data.frame, which are passed to `expand.grid()` as a single argument. In that linked scenario, the values are passed as multiple arguments to `expand.grid()`. – wibeasley Jun 20 '18 at 16:02

2 Answers2

1

tidyr::complete() is designed for this. I'm surprised I don't see a vanilla example on SO.

library(magrittr)
x %>% 
  tidyr::complete(Col1, Col2)

Result:

# A tibble: 16 x 2
   Col1  Col2 
   <fct> <fct>
 1 A     W    
 2 A     X    
 3 A     Y    
 4 A     Z    
 5 B     W    
 6 B     X    
 7 B     Y    
 8 B     Z    
 9 C     W    
10 C     X    
11 C     Y    
12 C     Z    
13 D     W    
14 D     X    
15 D     Y    
16 D     Z    

If your real-world scenario is as simple as the OP, @bouncyball's suggestion of expand.grid(x) is the cleanest. If your real-world scenario has more complexity, then tidyr::complete() might allow you to grow more easily. I commonly have more than the two ID variables to expand/complete. These are typically the analyses' dependent/outcome variables, and the fill parameter allows you to specify their default value for combinations that don't appear in the observed dataset. Here's an SO example.

edited to reflect advice of @bouncyball and @ADuv.

wibeasley
  • 5,000
  • 3
  • 34
  • 62
  • 1
    I see your point, and I like both. You could remove the pipe and make it one line too. I like `complete()` a tad better because (a) the order matches the OP's target output (where the left column changes slower than the right) and (b) it's probably more natural if the dataset contains additional variables that you don't want crossed (like [this example](https://stackoverflow.com/a/39133601/1082435) using the `fill` parameter). – wibeasley Jun 20 '18 at 15:54
  • 1
    If this is the only goal, and you don't use `tidyverse` elsewhere, I recommend `expand.grid()` since tibbles are a different class from data.frames and can introduce weird `class` problems. But yes, depends on OPs goal. At least he has two different solutiosn to try now. – A Duv Jun 20 '18 at 15:54
  • 1
    You both are right. I'll edit my response and walk back the 'cleanest' claim. I wrote that before @bouncyball's OP comment, and didn't remember `expand.grid()` accepted entire data.frames. – wibeasley Jun 20 '18 at 16:04
0

Regarding tidyr::complete vs base::expand.grid, performance might also be a factor.

According to the benchmark below complete is much slower, though difference decreases with input size.

df <- data.frame(a= 1:10,b= 1:10)
# microbenchmark(complete(df,a,b), expand.grid(df))
# Unit: microseconds
#               expr       min       lq       mean    median        uq       max neval
# complete(df, a, b) 15345.348 16065.27 17947.2132 16609.512 17351.317 46415.772   100
#    expand.grid(df)   129.194   144.74   174.8799   194.395   201.337   256.577   100

df <- data.frame(a= 1:100,b= 1:100)
# microbenchmark(complete(df,a,b), expand.grid(df))
# Unit: microseconds
#               expr       min         lq       mean     median        uq      max neval
# complete(df, a, b) 15992.523 16380.1030 17743.4860 16611.4730 16998.149 26622.31   100
#    expand.grid(df)   323.588   340.4925   376.6481   383.6575   397.844   665.89   100

df <- data.frame(a= 1:1000,b= 1:1000)
microbenchmark(complete(df,a,b), expand.grid(df))
# Unit: milliseconds
#               expr      min       lq     mean   median       uq       max neval
# complete(df, a, b) 86.58981 88.49813 98.73944 93.62617 98.83436 157.40141   100
#    expand.grid(df) 18.99899 19.40211 21.83331 21.20161 23.71123  33.19729   100
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167