2

I have a list of integers, say: (1,2,3,4,5)

I want to obtain all the possible lists of size 5, such that:

1. The lists can contain repeat elements, e.g. (1,1,1,2,2)

2. Ordering does not matter, e.g. (1,1,2,2,1) is the same as (1,1,1,2,2)

How do I obtain this whole list? I am actually looking for combinations of size 10 from a set of 10 integers.

KohLP
  • 25
  • 6
  • These are called *multisets*, I think [if you refer to the multisets section of this answer](https://stackoverflow.com/a/47983855/903061) you will find a solution that suits you. – Gregor Thomas Apr 16 '18 at 22:36
  • I'm not sure which multiset corresponds to this. Sorry I am very new to R – KohLP Apr 16 '18 at 22:56

3 Answers3

2

The link provided by Gregor seems to rely entirely on third-party packages to produce multisets, so I wanted to give you a base R solution. Note that the packages mentioned in that link will almost certainly be far more efficient for extremely large datasets.

We can use expand.grid() to first generate all possible permutations with repetition of the elements in (1,2,3,4,5). In this case, different orderings are still considered to be distinct. We now want to remove these "extra" combinations that contain the same elements but have different orders, which we can do by using apply() and duplicated().

If you use the multiset calculator here, you'll find that the code below produces the correct number of combinations. Here's the code:

x <- seq(1:5)

df <- expand.grid(x, x, x, x, x) # generates 5^5 combinations, allowing repetition

index <- !duplicated(t(apply(df, 1, sort))) # find extraneous combinations
df <- df[index, ] # select only unique combinations

# check number of rows. It should be 126; one for each combination
nrows(df)

# Output
# [1] 126

# Quick look at part of the dataframe:

head(df)
  Var1 Var2 Var3 Var4 Var5
1    1    1    1    1    1
2    2    1    1    1    1
3    3    1    1    1    1
4    4    1    1    1    1
5    5    1    1    1    1
7    2    2    1    1    1
Marcus Campbell
  • 2,746
  • 4
  • 22
  • 36
  • Thank you so much for the help. That is what I am trying to get at. I am actually looking for 10^10, and I don't think that would be feasible to perform. I am looking at the multisets page, and it's kind of all going over my head. Would you recommend any of those packages? It seems like iterpc is the one to use, but I don't know what it does. – KohLP Apr 16 '18 at 23:14
  • `iterpc` looks good to me, but I honestly have no experience with those packages. My apologies - I thought your actual problem would be closer in scale to your toy data. Personally I'd recommend Maurits Evers' solution over mine; it's much more memory-efficient. Do note however that it will still require a fair amount of computer memory. – Marcus Campbell Apr 16 '18 at 23:24
2

Using the RcppAlgos solution recommended in this answer, we want to choose sets of 5 elements from your input, with repetition, and order doesn't matter (thus comboGeneral(), we would use permuteGeneral() if order mattered). Being coded in C++ under the hood, this will be a very fast solution, and the profiling in the linked answer also found it to be memory efficient. Generating the sets for 10 multichoose 10 still took less than a second on my laptop.

library(RcppAlgos)
x = 1:5
result = comboGeneral(x, m = 5, repetition = T)
dim(result)
# [1] 126   5
head(result)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1    1    1    1    1
# [2,]    1    1    1    1    2
# [3,]    1    1    1    1    3
# [4,]    1    1    1    1    4
# [5,]    1    1    1    1    5
# [6,]    1    1    1    2    2
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
1

For a similar approach to @MarcusCampbell's within the tidyverse we can use expand to enumerate all possible combinations, and then only keep distinct combinations that are invariant under a permutation (i.e. where the ordering does not matter):

library(tidyverse);
tibble(V1 = 1:5, V2 = 1:5, V3 = 1:5, V4 = 1:5, V5 = 1:5) %>%
    expand(V1, V2, V3, V4, V5) %>%
    rowwise() %>%
    mutate(cmbn = paste(sort(c(V1, V2, V3, V4, V5)), collapse = ",")) %>%
    distinct(cmbn);
    ## A tibble: 126 x 1
    #   cmbn
    #   <chr>
    # 1 1,1,1,1,1
    # 2 1,1,1,1,2
    # 3 1,1,1,1,3
    # 4 1,1,1,1,4
    # 5 1,1,1,1,5
    # 6 1,1,1,2,2
    # 7 1,1,1,2,3
    # 8 1,1,1,2,4
    # 9 1,1,1,2,5
    #10 1,1,1,3,3
    ## ... with 116 more rows
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • To anyone reading through these answers, this is a better solution than mine. I haven't had time to benchmark it, but it's likely more memory efficient than my method by about an order of magnitude. – Marcus Campbell Apr 16 '18 at 23:27