All possible subpopulations from more than one variable

Question

I have data that looks like this

v1    = sample(c("a","b"), 1000, replace = T)
v2    = sample(c("c","d"), 1000, replace = T)
X     = cbind(v1, v2)

That is, two variables that can take two values each. The goal is to generate an index or something similar to subset this data into all possible subsets based on these two variables. The nine subsets can be described using the following conditions (in a slight abuse of notation):

#1# (a    ) & (c    )
#2# (a    ) & (    d)
#3# (    b) & (c    )
#4# (    b) & (    d)
#5# (a    ) & (c | d)
#6# (    b) & (c | d)
#7# (a | b) & (c    )
#8# (a | b) & (    d)
#9# (a | b) & (c | d)

That is, subset #1 should fulfill (var1 == "a") & (var2 == "c"), subset #5 should fulfill var1 == "a", while subset #9 corresponds to the full data set, and so on.

This question is probably strongly related to this one and I suspect that what I want can be accomplished using combn(). However, I could not figure out how to expand the answers therein to my problem with more than one variable.

It is definitely possible (and at the same time, extremely inelegant) to solve this specific problem using a hardcoded loop. However, the solution should generalize to more variables and a varying number of values for each variable. Hence, this becomes unfeasible quickly.

EDIT: Found an answer in another thread, flagged this one as duplicate.

I don't understand how can `a & b & c & d` be achieved with only two variables (`var1` and `var2`)? — maydin, Jul 23 '20 at 11:30
That's bad notation on my side, point taken. `a & b & c & d` is supposed to mean "the whole data set", i.e. what @Bas said. I thought that the table I provided may be a bit clearer to read than the full logical statements. — yrx1702, Jul 23 '20 at 11:33
I edited my original post and included logical statements that make more sense than what I provided originally. — yrx1702, Jul 23 '20 at 11:37

score 0 · Answer 1 · answered Jul 23 '20 at 12:09

You didn't really describe the format of the output that you wanted, but here is a data.frame, where the two columns represent the combinations of the two variables. I converted your initial object into a data.frame as I am assuming that is what you are working with and it lends itself more naturally to this operation.

Note that my solution is more or less a nested loop, replacing for with lapply.

varList <-
lapply(X, function(s) {
  s <- sort(unique(s))
  unlist(lapply(seq_along(s), function(n) combn(s, n, paste, collapse=" ")))
})

Here, loop through the columns, get unique values of each column, and use lapply and combn to get all combinations of the levels of the variable. If you wanted to keep these more structural, you could play around with using list rather than paste.

Next get combinations of each list element to return the final result.

expand.grid(varList)

This returns

   v1  v2
1   a   c
2   b   c
3 a b   c
4   a   d
5   b   d
6 a b   d
7   a c d
8   b c d
9 a b c d

where the variables are factors. You can convert them to strings using the stringsAsFactors argument in expand.grid.

Thanks for answering. Unfortunately, the `lapply()` returns a list of length 2000 that just contains the data itself for me. — yrx1702, Jul 23 '20 at 12:50
In addition, my desired output would be the nine subsets of data in a list. — yrx1702, Jul 23 '20 at 12:53

All possible subpopulations from more than one variable

1 Answers1