2

Been working on this for two days, and don't see any progress. Say I have 20 numbers, and I want to (without replacement)

  • get one unique group of 10 numbers
  • get two unique groups of 3 numbers
  • get two unique groups of 2 numbers

There are (20 choose 10) * (10 choose 3) * (7 choose 3) * (4 choose 2) * (2 choose 2) = 4,655,851,200 total groups.

I'd like either a nested list or a data.frame that I can evaluate these groups over. Is that possible to do quickly in R?

Below I have a pretty unwieldy solution for the case when I'm just trying to get one unique group of 5, one unique group of 3, and one unique group of 2, from 10 numbers.

This code actually works, but quickly becomes untenable for larger groups. I'm wondering if there's a fast way to create this grid? It seems like it should be easily possible, but I can't figure out how. Thanks in advance!!


library(data.table)
library(gtools)

n <- 10
group_sizes <- c(5,3,2)

# initialize first group
group_3_values <-
  combinations(
    n = n,
    r = group_sizes[1], # size of the first group!
    repeats.allowed = T,
    v = 1:n
  )
group_3_values <- as.data.table(group_3_values)
group_3_values[, row := 1:nrow(group_3_values)]
output <- as.data.table(group_3_values)

# run for loop for the remaining groups
start_time <- Sys.time()
for (group_size in group_sizes[-1]) { # size of the second, etc. groups
  output_list <- list()
  for (i in 1:nrow(output)) {
    remaining_options <-
      setdiff(1:n, output[row == i, .SD, .SDcols = !c('row')])
    second_group <- combinations(
      n = length(remaining_options),
      r = group_size,
      repeats.allowed = F,
      v = remaining_options
    )
    second_group <- as.data.table(second_group)
    second_group[, row := i]
    out <-
      merge(
        output[row == i,],
        second_group,
        by = "row",
        all.x = T,
        all.y = T,
        allow.cartesian = T
      )
    out$row <- NULL
    output_list[[i]] <- as.data.table(out)
    out <- NULL
  }
  output <- as.data.table(rbindlist(output_list))
  output[, row := 1:nrow(output)]
}
end_time <- Sys.time()
duration <- end_time - start_time
print(duration)
Daycent
  • 455
  • 4
  • 15
  • 2
    I am not understanding the goal. Maybe provide a sample of the expected output. If you are looking to divide 20 numbers into 5 separate groups, why not use the `sample()` function to shuffle the initial groups and then divide into the groups of 10, 3, 3, 2, and 2. – Dave2e Oct 22 '22 at 04:14
  • @Dave2e The problem with that is that you will also get e.g. `[1, 2, ..., 10]`, `[2, 1, ..., 10]` for the first group, which is not unique. – jay.sf Oct 22 '22 at 05:51
  • 1
    There is a function in `RcppAlgos` called `comboGroups` that will partition a vector into groups of equal size. Your problem is similar and is something I've thought about for a while (I am the author). I always found it tricky to nail down whether the order of the groups by size matters. When all groups are the same size, you don't have this problem. What I mean is, can the first group be of size 3, or does it have to be of size 10? This changes things considerably. If order doesn't matter, you will be happy to know that the total is greatly reduced! – Joseph Wood Oct 22 '22 at 17:15
  • yes, order doesn't matter! But ignoring order should give you ~4m groups, no? e.g. (20 choose 3) * (17 choose 10) * (7 choose 3) * (4 choose 2) * (2 choose 2) == (20 choose 10) * (10 choose 3) * (7 choose 3) * (4 choose 2) * (2 choose 2) – Daycent Oct 22 '22 at 17:18
  • @Daycent, I think you meant 4 billion... Anyway, no, you will not get that many results since you have multiple groups of the same width. Consider a simple example where we want to get two groups of 2 from 4. Using your logic, we would get `choose(4, 2) * choose(2, 2) = 6` total results. If you write them out, you will see that the last 3 are duplicates of the first 3 only flipped around. So technically, for your example there would be `4655851200 / (1! * 2! * 2!) = 1163962800` total results. – Joseph Wood Oct 23 '22 at 23:58
  • See this question https://stackoverflow.com/q/57732672/4408538 for more information. There is some really good discussion in the answers and comments. – Joseph Wood Oct 24 '22 at 00:02
  • 1
    This is incredibly helpful Joseph. Thanks so much for your amazing work! – Daycent Oct 25 '22 at 00:23

1 Answers1

4

I think materializing a table or even a vector of 4,655,851,200 elements is not a good choice. Perhaps the best you can do is to use combn to generate each combination, with a callback to run your desired code.

This example generates and prints the elements for every combination and increases a counter n in the global scope. I use callCC for early exit at 10 iterations .

x <- 1:20

n<-0
callCC(function(exit)
  combn(x, 10, function(i)
    combn(setdiff(x, c(i)), 3, function(j)
      combn(setdiff(x, c(i,j)), 3, function(k)
        combn(setdiff(x, c(i,j,k)), 2, function(l)
          combn(setdiff(x, c(i,j,k,l)), 2, function(m){
            n<<-n+1
            # YOUR CODE HERE
            print(paste0(sapply(list(i,j,k,l,m), paste0, collapse=" "), collapse=" | "))
            if(n>=10) exit("early exit")
          }, simplify=F),
        simplify=F),
      simplify=F),
    simplify=F),
  simplify=F))

#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 16 | 17 18 | 19 20"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 16 | 17 19 | 18 20"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 16 | 17 20 | 18 19"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 16 | 18 19 | 17 20"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 16 | 18 20 | 17 19"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 16 | 19 20 | 17 18"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 17 | 16 18 | 19 20"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 17 | 16 19 | 18 20"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 17 | 16 20 | 18 19"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 17 | 18 19 | 16 20"
#> [1] "early exit"
Ric
  • 5,362
  • 1
  • 10
  • 23