Non-redundant version of expand.grid

Question

The R function expand.grid returns all possible combination between the elements of supplied parameters. e.g.

> expand.grid(c("aa", "ab", "cc"), c("aa", "ab", "cc"))
  Var1 Var2
1   aa   aa
2   ab   aa
3   cc   aa
4   aa   ab
5   ab   ab
6   cc   ab
7   aa   cc
8   ab   cc
9   cc   cc

Do you know an efficient way to get directly (so without any row comparison after expand.grid) only the 'unique' combinations between the supplied vectors? The output will be

  Var1 Var2
1   aa   aa
2   ab   aa
3   cc   aa
5   ab   ab
6   cc   ab
9   cc   cc

EDIT the combination of each element with itself could be eventually discarded from the answer. I don't actually need it in my program even though (mathematically) aa aa would be one (regular) unique combination between one element of Var1 and another of var2.

The solution needs to produce pairs of elements from both vectors (i.e. one from each of the input vectors - so that it could be applied to more than 2 inputs)

Isn't [this](https://stackoverflow.com/questions/12245213/how-to-generate-all-possible-combinations-of-vectors-without-caring-for-order) just the same question? — 5th, Nov 19 '17 at 13:19
Possible duplicate of [How to generate all possible combinations of vectors without caring for order?](https://stackoverflow.com/questions/12245213/how-to-generate-all-possible-combinations-of-vectors-without-caring-for-order) — 5th, Nov 19 '17 at 13:21
I don't think so. That asks about elements from a single vectors. the accepted answers provides also a way to produce the combinations from elements of multiple inputs (2 or more) — Michele, Nov 21 '17 at 11:02

Simon O'Hanlon · Accepted Answer · 2013-06-18T14:25:47.023

33

How about using outer? But this particular function concatenates them into one character string.

outer( c("aa", "ab", "cc"), c("aa", "ab", "cc") , "paste" )
#     [,1]    [,2]    [,3]   
#[1,] "aa aa" "aa ab" "aa cc"
#[2,] "ab aa" "ab ab" "ab cc"
#[3,] "cc aa" "cc ab" "cc cc"

You can also use combn on the unique elements of the two vectors if you don't want the repeating elements (e.g. aa aa)

vals <- c( c("aa", "ab", "cc"), c("aa", "ab", "cc") )
vals <- unique( vals )
combn( vals , 2 )
#     [,1] [,2] [,3]
#[1,] "aa" "aa" "ab"
#[2,] "ab" "cc" "cc"

edited Jun 18 '13 at 14:25

answered Jun 18 '13 at 14:15

Simon O'Hanlon

58,647
14
142
184

oh no, thanks. I need them separate to pass to other function. the previous was perfect! I actually realised I don't need `aa aa`, `bb bb` and `cc cc`. so if you re-edit back and I'll accept it :-) – Michele Jun 18 '13 at 14:21
4

Don't want to offend, but I don't see how this answer is the most upvoted accepted answer and continuing to get upvotes. Both approaches are incorrect. The first proposal with `outer` is nearly identical to `expand.grid`, it is just in a different format. It is also limited to only 2 inputs. For the 2nd approach, `combn` can only act on a single vector. If you have vectors with different elements the approach suggested here will give many incorrect results. Continuing... – Joseph Wood Jun 19 '21 at 20:15
4

The most extreme example is when we have disjoint vectors E.g. `v1 = 1:5, v2 = 6:10`. The correct solution should be isomorphic to `expand.grid(1:5, 6:10)`, which only has 25 results, however `combn(1:10, 2)` has 45 results. Again, I don't mean to offend, I just want to help to clarify for future users. – Joseph Wood Jun 19 '21 at 20:17
I have added an answer that addresses these concerns: https://stackoverflow.com/a/68050873/4408538 – Joseph Wood Jun 21 '21 at 13:46

Ferdinand.kraft · Answer 2 · 2013-06-18T16:07:16.510

26

In base R, you can use this:

expand.grid.unique <- function(x, y, include.equals=FALSE)
{
    x <- unique(x)

    y <- unique(y)

    g <- function(i)
    {
        z <- setdiff(y, x[seq_len(i-include.equals)])

        if(length(z)) cbind(x[i], z, deparse.level=0)
    }

    do.call(rbind, lapply(seq_along(x), g))
}

Results:

> x <- c("aa", "ab", "cc")
> y <- c("aa", "ab", "cc")

> expand.grid.unique(x, y)
     [,1] [,2]
[1,] "aa" "ab"
[2,] "aa" "cc"
[3,] "ab" "cc"

> expand.grid.unique(x, y, include.equals=TRUE)
     [,1] [,2]
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"

edited Jun 18 '13 at 16:07

answered Jun 18 '13 at 14:38

Ferdinand.kraft

12,579
10
47
69

I note that it is missing a way to get the self-relationships included, but not the duplicate ones. For instance ("aa", "aa") is not a duplicate for it only appears only. In some cases, one wants the self-relationships too, but not the duplicates. – CoderGuy123 Nov 10 '16 at 11:24

score 18 · Answer 3 · answered Jun 18 '13 at 14:26

18

If the two vectors are the same, there's the combinations function in the gtools package:

library(gtools)
combinations(n = 3, r = 2, v = c("aa", "ab", "cc"), repeats.allowed = TRUE)

#      [,1] [,2]
# [1,] "aa" "aa"
# [2,] "aa" "ab"
# [3,] "aa" "cc"
# [4,] "ab" "ab"
# [5,] "ab" "cc"
# [6,] "cc" "cc"

And without "aa" "aa", etc.

combinations(n = 3, r = 2, v = c("aa", "ab", "cc"), repeats.allowed = FALSE)

answered Jun 18 '13 at 14:26

BenBarnes

19,114
6
56
74

1

I note that repeats.allowed also removes the self-pairs such as ("aa", "aa"), which are not actually repeated. There is a missing middle way with this function. – CoderGuy123 Nov 10 '16 at 11:22

score 16 · Answer 4 · answered Nov 10 '16 at 12:01

The previous answers were lacking a way to get a specific result, namely to keep the self-pairs but remove the ones with different orders. The gtools package has two functions for these purposes, combinations and permutations. According to this website:

When the order doesn't matter, it is a Combination.
When the order does matter it is a Permutation.

In both cases, we have the decision to make of whether repetitions are allowed or not, and correspondingly, both functions have a repeats.allowed argument, yielding 4 combinations (deliciously meta!). It's worth going over each of these. I simplified the vector to single letters for ease of understanding.

Permutations with repetition

The most expansive option is to allow both self-relations and differently ordered options:

> permutations(n = 3, r = 2, repeats.allowed = T, v = c("a", "b", "c"))
      [,1] [,2]
 [1,] "a"  "a" 
 [2,] "a"  "b" 
 [3,] "a"  "c" 
 [4,] "b"  "a" 
 [5,] "b"  "b" 
 [6,] "b"  "c" 
 [7,] "c"  "a" 
 [8,] "c"  "b" 
 [9,] "c"  "c"

which gives us 9 options. This value can be found from the simple formula n^r i.e. 3^2=9. This is the Cartesian product/join for users familiar with SQL.

There are two ways to limit this: 1) remove self-relations (disallow repetitions), or 2) remove differently ordered options (i.e. combinations).

Combinations with repetitions

If we want to remove differently ordered options, we use:

> combinations(n = 3, r = 2, repeats.allowed = T, v = c("a", "b", "c"))
     [,1] [,2]
[1,] "a"  "a" 
[2,] "a"  "b" 
[3,] "a"  "c" 
[4,] "b"  "b" 
[5,] "b"  "c" 
[6,] "c"  "c"

which gives us 6 options. The formula for this value is (r+n-1)!/(r!*(n-1)!) i.e. (2+3-1)!/(2!*(3-1)!)=4!/(2*2!)=24/4=6.

Permutations without repetition

If instead we want to disallow repetitions, we use:

> permutations(n = 3, r = 2, repeats.allowed = F, v = c("a", "b", "c"))
     [,1] [,2]
[1,] "a"  "b" 
[2,] "a"  "c" 
[3,] "b"  "a" 
[4,] "b"  "c" 
[5,] "c"  "a" 
[6,] "c"  "b"

which also gives us 6 options, but different ones! The number of options is the same as above but it's a coincidence. The value can be found from the formula n!/(n-r)! i.e. (3*2*1)/(3-2)!=6/1!=6.

Combinations without repetitions

The most limiting is when we want neither self-relations/repetitions or differently ordered options, in which case we use:

> combinations(n = 3, r = 2, repeats.allowed = F, v = c("a", "b", "c"))
     [,1] [,2]
[1,] "a"  "b" 
[2,] "a"  "c" 
[3,] "b"  "c"

which gives us only 3 options. The number of options can be calculated from the rather complex formula n!/(r!(n-r)!) i.e. 3*2*1/(2*1*(3-2)!)=6/(2*1!)=6/2=3.

score 14 · Answer 5 · answered Feb 28 '14 at 15:59

14

Try:

factors <- c("a", "b", "c")

all.combos <- t(combn(factors,2))

     [,1] [,2]
[1,] "a"  "b" 
[2,] "a"  "c" 
[3,] "b"  "c"

This will not include duplicates of each factor (e.g. "a" "a"), but you can add those on easily if needed.

dup.combos <- cbind(factors,factors)

     factors factors
[1,] "a"     "a"    
[2,] "b"     "b"    
[3,] "c"     "c"   

all.combos <- rbind(all.combos,dup.combos)

     factors factors
[1,] "a"     "b"    
[2,] "a"     "c"    
[3,] "b"     "c"    
[4,] "a"     "a"    
[5,] "b"     "b"    
[6,] "c"     "c"

answered Feb 28 '14 at 15:59

EvH

141
1
2

1

This is by far the simplest and most direct method - doesn't require any additional packages and does what was asked for in 1 line. – Ava Nov 07 '22 at 17:35
@Ava - I agree. Was looking for the equivalent of `t(combn(1:nrow(df),2))` and there are a myriad of overcomplicated approaches IMO . This nails it in base R _and_ on one line. – L Tyrone May 26 '23 at 03:02

score 5 · Answer 6 · answered Feb 01 '20 at 16:08

You can use a "greater than" operation to filter redundant combinations. This works with both numeric and character vectors.

> grid <- expand.grid(c("aa", "ab", "cc"), c("aa", "ab", "cc"), stringsAsFactors = F)
> grid[grid$Var1 >= grid$Var2, ]
  Var1 Var2
1   aa   aa
2   ab   aa
3   cc   aa
5   ab   ab
6   cc   ab
9   cc   cc

This shouldn't slow down your code too much. If you're expanding vectors containing larger elements (e.g. two lists of dataframes), I recommend using numeric indices that refer to the original vectors.

Joseph Wood · Answer 7 · 2021-06-21T13:45:11.300

TL;DR

Use comboGrid from RcppAlgos:

library(RcppAlgos)
comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"))
     Var1 Var2
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"

The Details

I recently came across this question R - Expand Grid Without Duplicates and as I was searching for duplicates, I found this question. The question there isn't exactly a duplicate, as it is a bit more general and has additional restrictions which @Ferdinand.kraft shined some light on.

It should be noted that many of the solutions here make use of some sort of combination function. The expand.grid function returns the Cartesian product which is fundamentally different.

The Cartesian product operates on multiple objects which may or may not be the same. Generally speaking, combination functions are applied to a single vector. The same can be said about permutation functions.

Using combination/permutation functions will only produce comparable results to expand.grid if the vectors supplied are identical. As a very simple example, consider v1 = 1:3, v2 = 2:4.

With expand.grid, we see that rows 3 and 5 are duplicates:

expand.grid(1:3, 2:4)
  Var1 Var2
1    1    2
2    2    2
3    3    2
4    1    3
5    2    3
6    3    3
7    1    4
8    2    4
9    3    4

Using combn doesn't quite get us to the solution:

t(combn(unique(c(1:3, 2:4)), 2))
     [,1] [,2]
[1,]    1    2
[2,]    1    3
[3,]    1    4
[4,]    2    3
[5,]    2    4
[6,]    3    4

And with repeats using gtools, we generate too many:

gtools::combinations(4, 2, v = unique(c(1:3, 2:4)), repeats.allowed = TRUE)
      [,1] [,2]
 [1,]    1    1
 [2,]    1    2
 [3,]    1    3
 [4,]    1    4
 [5,]    2    2
 [6,]    2    3
 [7,]    2    4
 [8,]    3    3
 [9,]    3    4
[10,]    4    4

In fact we generate results that are not even in the cartesian product (i.e. expand.grid solution).

We need a solution that creates the following:

     Var1 Var2
[1,]    1    2
[2,]    1    3
[3,]    1    4
[4,]    2    2
[5,]    2    3
[6,]    2    4
[7,]    3    3
[8,]    3    4

I authored the package RcppAlgos and in the latest release v2.4.3, there is a function comboGrid which addresses this very problem. It is very general, flexible, and is fast.

First, to answer the specific question raised by the OP:

library(RcppAlgos)
comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"))
     Var1 Var2
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"

And as, @Ferdinand.kraft points out, sometimes the output may need to have duplicates excluded in a given row. For that, we use repetition = FALSE:

comboGrid(c("aa", "ab", "cc"), c("aa", "ab", "cc"), repetition = FALSE)
     Var1 Var2
[1,] "aa" "ab"
[2,] "aa" "cc"
[3,] "ab" "cc"

comboGrid is also very general. It can be applied to multiple vectors:

comboGrid(rep(list(c("aa", "ab", "cc")), 3))
      Var1 Var2 Var3
 [1,] "aa" "aa" "aa"
 [2,] "aa" "aa" "ab"
 [3,] "aa" "aa" "cc"
 [4,] "aa" "ab" "ab"
 [5,] "aa" "ab" "cc"
 [6,] "aa" "cc" "cc"
 [7,] "ab" "ab" "ab"
 [8,] "ab" "ab" "cc"
 [9,] "ab" "cc" "cc"
[10,] "cc" "cc" "cc"

Doesn't need the vectors to be identical:

comboGrid(1:3, 2:4)
     Var1 Var2
[1,]    1    2
[2,]    1    3
[3,]    1    4
[4,]    2    2
[5,]    2    3
[6,]    2    4
[7,]    3    3
[8,]    3    4

And can be applied to vectors of various types:

set.seed(123)
my_range <- 3:15
mixed_types <- list(
    int1 = sample(15, sample(my_range, 1)),
    int2 = sample(15, sample(my_range, 1)),
    char1 = sample(LETTERS, sample(my_range, 1)),
    char2 = sample(LETTERS, sample(my_range, 1))
)

dim(expand.grid(mixed_types))
[1] 1950    4

dim(comboGrid(mixed_types, repetition = FALSE))
[1] 1595    4

dim(comboGrid(mixed_types, repetition = TRUE))
[1] 1770    4

The algorithm employed avoids generating the entirety of the Cartesian product and subsequently removing dupes. Ultimately, we create a hash table using the Fundamental theorem of arithmetic along with deduplication as pointed out by user2357112 supports Monica in the answer to Picking unordered combinations from pools with overlap. All of this together with the fact that it is written in C++ means that it is fast and memory efficient:

pools = list(c(1, 10, 14, 6),
             c(7, 2, 4, 8, 3, 11, 12),
             c(11, 3, 13, 4, 15, 8, 6, 5),
             c(10, 1, 3, 2, 9, 5,  7),
             c(1, 5, 10, 3, 8, 14),
             c(15, 3, 7, 10, 4, 5, 8, 6),
             c(14, 9, 11, 15),
             c(7, 6, 13, 14, 10, 11, 9, 4),
             c(6,  3,  2, 14,  7, 12,  9),
             c(6, 11,  2,  5, 15,  7))
             
system.time(combCarts <- comboGrid(pools))
   user  system elapsed 
  0.929   0.062   0.992

nrow(combCarts)
[1] 1205740

## Small object created
print(object.size(combCarts), unit = "Mb")
92 Mb
  
system.time(cartProd <- expand.grid(pools))
   user  system elapsed 
  8.477   2.895  11.461 
  
prod(lengths(pools))
[1] 101154816

## Very large object created
print(object.size(cartProd), unit = "Mb")
7717.5 Mb

score 0 · Answer 8 · answered Apr 06 '20 at 21:10

here's a very ugly version that worked for me on a similar problem.

AHP_code = letters[1:10] 
 temp. <- expand.grid(AHP_code, AHP_code, stringsAsFactors = FALSE)
  temp. <- temp.[temp.$Var1 != temp.$Var2, ] # remove AA, BB, CC, etc. 
  temp.$combo <- NA 
  for(i in 1:nrow(temp.)){  # vectorizing this gave me weird results, loop worked fine. 
    temp.$combo[i] <- paste0(sort(as.character(temp.[i, 1:2])), collapse = "")
  }
  temp. <- temp.[!duplicated(temp.$combo),]
  temp.

score 0 · Answer 9 · answered Mar 03 '22 at 12:11

USING SORT

Just for fun, one can in principle also remove duplicates from expand.grid by combining sort and unique.

unique(t(apply(expand.grid(c("aa", "ab", "cc"), c("aa", "ab", "cc")), 1, sort)))

This gives:

    [,1] [,2]
[1,] "aa" "aa"
[2,] "aa" "ab"
[3,] "aa" "cc"
[4,] "ab" "ab"
[5,] "ab" "cc"
[6,] "cc" "cc"

score 0 · Answer 10 · answered Jul 29 '22 at 07:43

With repetitions (this won't work if you specify different vectors for different columns and for example values in the first column are always bigger than values in the second column):

> v=c("aa","ab","cc")
> e=expand.grid(v,v,stringsAsFactors=F)
> e[!apply(e,1,is.unsorted),]
  Var1 Var2
1   aa   aa
4   aa   ab
5   ab   ab
7   aa   cc
8   ab   cc
9   cc   cc

Without repetitions (this requires using the same vector for each column):

> t(combn(c("aa","ab","cc"),2))
     [,1] [,2]
[1,] "aa" "ab"
[2,] "aa" "cc"
[3,] "ab" "cc"

With repetitions and with different vectors for different columns:

> e=expand.grid(letters[25:26],letters[1:3],letters[2:3],stringsAsFactors=F)
> e[!duplicated(t(apply(e,1,sort))),]
   Var1 Var2 Var3
1     y    a    b
2     z    a    b
3     y    b    b
4     z    b    b
5     y    c    b
6     z    c    b
7     y    a    c
8     z    a    c
11    y    c    c
12    z    c    c

Without repetitions and with different vectors for different columns:

> e=expand.grid(letters[25:26],letters[1:3],letters[2:3],stringsAsFactors=F)
> e=e[!duplicated(t(apply(e,1,sort))),]
> e[!apply(apply(e,1,duplicated),2,any),]
  Var1 Var2 Var3
1    y    a    b
2    z    a    b
5    y    c    b
6    z    c    b
7    y    a    c
8    z    a    c