R: extract inner higher level combinations (groups of 1, 2, 3, and 4 elements) out of a data frame of combinations of 5 elements

Question

Sorry I have to post another question following up on this one and this other one.

While the answer to the second one addresses the MWE perfectly, in my real world data I need to do things differently, and wondered if someone could help.

So this time around, my starting point is a data frame (named plusminus_df) of combinations of 5 elements (in reality it can be 1 to n), of the following form:

> markers=LETTERS[1:5]
> plusminus_df <- expand.grid(lapply(seq(markers), function(x) c("+","-")), stringsAsFactors=FALSE)
> names(plusminus_df)=LETTERS[1:5]
> head(plusminus_df)
  A B C D E
1 + + + + +
2 - + + + +
3 + - + + +
4 - - + + +
5 + + - + +
6 - + - + +

So it is just a dataframe of combinations of +/- for all the 5 markers (note this is a variable number). What I would need to do at this point, is to extract the inner higher level combinations of 1, 2, 3, and 4 markers (note these are variable numbers), preserving this same dataframe structure (in that sense, I would need to include NAs).

So my expected result would be something like this:

> final_df
      A    B    C    D    E
1     + <NA> <NA> <NA> <NA>
2     - <NA> <NA> <NA> <NA>
3     +    - <NA> <NA> <NA>
4     -    - <NA> <NA> <NA>
5     +    + <NA> <NA> <NA>
6     -    + <NA> <NA> <NA>
7     +    -    - <NA> <NA>
8     -    -    - <NA> <NA>
9     +    +    + <NA> <NA>
10    -    +    + <NA> <NA>
11    +    -    + <NA> <NA>
12    -    -    + <NA> <NA>
13    +    +    - <NA> <NA>
14    -    +    - <NA> <NA>
15    +    -    -    - <NA>
16    -    -    -    - <NA>
17    +    +    +    + <NA>
...
n     +    +    +    +    +
n+1   -    +    +    +    +
n+2   +    -    +    +    +
n+3   -    -    +    +    +
n+4   +    +    -    +    +
n+5   -    +    -    +    +
...

With all the possible combinations of 1 marker (+ and -), 2 markers, 3, 4, and 5 (as in the original), filling in the non-used markers with NA.

So the answer to the second question works well to build this desired final dataframe from scratch, just from the original markers vector. But in my real world case I can actually retrieve a filtered down list of 5 marker combinations in the form of the plusminus_df above... What would be the most straightforward and efficient way to obtain the desired dataframe from this one, without having to deal with messy nested loops?

ekoam · Answer 1 · 2020-10-27T10:51:08.177

Update

I should have asked this question days ago. What do you mean by "obtaining the desired dataframe from a filtered down list of 5 markers"? My solution differs from the other answers here because, for example, if you have a filtered down list like this,

A  B
-  -
+  +

then it only allows the following combinations in the output

 A   B
 -  NA
NA   -
 -   -
 +  NA
NA   +
 +   +

Note that you will never get "+ -" or "- +" because they are not combinations shown in your "filtered down list".

As far as I can tell, the other answers never consider this issue. Applying expand.grid (or other similar functions) to unique entries in A and B yields "+ -" and "- +" in the output. My answer is super inefficient also because I have no clue in solving this issue in an efficient manner. Please ignore my answer if I misunderstand your question.

However, perhaps you should clarify this point?

Original Answer

Is this what you want?

# First, expand each row to a dataframe of all possible combinations;
# use `head(..., -1L)` to drop the last combination, which is just a row of NAs.
# Then, select only those unique combinations in the resultant dataframe.

library(dplyr)
out <- unique(bind_rows(apply(
  sample.df, 1L, 
  function(r) head(expand.grid(lapply(r, c, NA_character_), stringsAsFactors = FALSE), -1L)
)))
row.names(out) <- NULL

sample.df looks like this (row numbers correspond to the ones in plusminus_df)

   A B C D E
12 - - + - +
19 + - + + -
17 + + + + -
21 + + - + -
3  + - + + +
5  + + - + +
8  - - - + +
24 - - - + -
31 + - - - -
6  - + - + +

Output looks like this

       A    B    C    D    E
1      -    -    +    -    +
2   <NA>    -    +    -    +
3      - <NA>    +    -    +
4   <NA> <NA>    +    -    +
5      -    - <NA>    -    +
6   <NA>    - <NA>    -    +
7      - <NA> <NA>    -    +
8   <NA> <NA> <NA>    -    +
9      -    -    + <NA>    +
10  <NA>    -    + <NA>    +
11     - <NA>    + <NA>    +
12  <NA> <NA>    + <NA>    +
13     -    - <NA> <NA>    +
14  <NA>    - <NA> <NA>    +
15     - <NA> <NA> <NA>    +
16  <NA> <NA> <NA> <NA>    +
17     -    -    +    - <NA>
18  <NA>    -    +    - <NA>
19     - <NA>    +    - <NA>
20  <NA> <NA>    +    - <NA>
21     -    - <NA>    - <NA>
22  <NA>    - <NA>    - <NA>
23     - <NA> <NA>    - <NA>
24  <NA> <NA> <NA>    - <NA>
25     -    -    + <NA> <NA>
26  <NA>    -    + <NA> <NA>
27     - <NA>    + <NA> <NA>
28  <NA> <NA>    + <NA> <NA>
29     -    - <NA> <NA> <NA>
30  <NA>    - <NA> <NA> <NA>
31     - <NA> <NA> <NA> <NA>
32     +    -    +    +    -
33  <NA>    -    +    +    -
34     + <NA>    +    +    -
35  <NA> <NA>    +    +    -
36     +    - <NA>    +    -
37  <NA>    - <NA>    +    -
38     + <NA> <NA>    +    -
39  <NA> <NA> <NA>    +    -
40     +    -    + <NA>    -
41  <NA>    -    + <NA>    -
42     + <NA>    + <NA>    -
43  <NA> <NA>    + <NA>    -
44     +    - <NA> <NA>    -
45  <NA>    - <NA> <NA>    -
46     + <NA> <NA> <NA>    -
47  <NA> <NA> <NA> <NA>    -
48     +    -    +    + <NA>
49  <NA>    -    +    + <NA>
50     + <NA>    +    + <NA>
51  <NA> <NA>    +    + <NA>
52     +    - <NA>    + <NA>
53  <NA>    - <NA>    + <NA>
54     + <NA> <NA>    + <NA>
55  <NA> <NA> <NA>    + <NA>
56     +    -    + <NA> <NA>
57     + <NA>    + <NA> <NA>
58     +    - <NA> <NA> <NA>
59     + <NA> <NA> <NA> <NA>
60     +    +    +    +    -
61  <NA>    +    +    +    -
62     +    + <NA>    +    -
63  <NA>    + <NA>    +    -
64     +    +    + <NA>    -
65  <NA>    +    + <NA>    -
66     +    + <NA> <NA>    -
67  <NA>    + <NA> <NA>    -
68     +    +    +    + <NA>
69  <NA>    +    +    + <NA>
70     +    + <NA>    + <NA>
71  <NA>    + <NA>    + <NA>
72     +    +    + <NA> <NA>
73  <NA>    +    + <NA> <NA>
74     +    + <NA> <NA> <NA>
75  <NA>    + <NA> <NA> <NA>
76     +    +    -    +    -
77  <NA>    +    -    +    -
78     + <NA>    -    +    -
79  <NA> <NA>    -    +    -
80     +    +    - <NA>    -
81  <NA>    +    - <NA>    -
82     + <NA>    - <NA>    -
83  <NA> <NA>    - <NA>    -
84     +    +    -    + <NA>
85  <NA>    +    -    + <NA>
86     + <NA>    -    + <NA>
87  <NA> <NA>    -    + <NA>
88     +    +    - <NA> <NA>
89  <NA>    +    - <NA> <NA>
90     + <NA>    - <NA> <NA>
91  <NA> <NA>    - <NA> <NA>
92     +    -    +    +    +
93  <NA>    -    +    +    +
94     + <NA>    +    +    +
95  <NA> <NA>    +    +    +
96     +    - <NA>    +    +
97  <NA>    - <NA>    +    +
98     + <NA> <NA>    +    +
99  <NA> <NA> <NA>    +    +
100    +    -    + <NA>    +
101    + <NA>    + <NA>    +
102    +    - <NA> <NA>    +
103    + <NA> <NA> <NA>    +
104    +    +    -    +    +
105 <NA>    +    -    +    +
106    + <NA>    -    +    +
107 <NA> <NA>    -    +    +
108    +    + <NA>    +    +
109 <NA>    + <NA>    +    +
110    +    +    - <NA>    +
111 <NA>    +    - <NA>    +
112    + <NA>    - <NA>    +
113 <NA> <NA>    - <NA>    +
114    +    + <NA> <NA>    +
115 <NA>    + <NA> <NA>    +
116    -    -    -    +    +
117 <NA>    -    -    +    +
118    - <NA>    -    +    +
119    -    - <NA>    +    +
120    - <NA> <NA>    +    +
121    -    -    - <NA>    +
122 <NA>    -    - <NA>    +
123    - <NA>    - <NA>    +
124    -    -    -    + <NA>
125 <NA>    -    -    + <NA>
126    - <NA>    -    + <NA>
127    -    - <NA>    + <NA>
128    - <NA> <NA>    + <NA>
129    -    -    - <NA> <NA>
130 <NA>    -    - <NA> <NA>
131    - <NA>    - <NA> <NA>
132    -    -    -    +    -
133 <NA>    -    -    +    -
134    - <NA>    -    +    -
135    -    - <NA>    +    -
136    - <NA> <NA>    +    -
137    -    -    - <NA>    -
138 <NA>    -    - <NA>    -
139    - <NA>    - <NA>    -
140    -    - <NA> <NA>    -
141    - <NA> <NA> <NA>    -
142    +    -    -    -    -
143 <NA>    -    -    -    -
144    + <NA>    -    -    -
145 <NA> <NA>    -    -    -
146    +    - <NA>    -    -
147 <NA>    - <NA>    -    -
148    + <NA> <NA>    -    -
149 <NA> <NA> <NA>    -    -
150    +    -    - <NA>    -
151    +    -    -    - <NA>
152 <NA>    -    -    - <NA>
153    + <NA>    -    - <NA>
154 <NA> <NA>    -    - <NA>
155    +    - <NA>    - <NA>
156    + <NA> <NA>    - <NA>
157    +    -    - <NA> <NA>
158    -    +    -    +    +
159    -    + <NA>    +    +
160    -    +    - <NA>    +
161    -    + <NA> <NA>    +
162    -    +    -    + <NA>
163    -    + <NA>    + <NA>
164    -    +    - <NA> <NA>
165    -    + <NA> <NA> <NA>

score 3 · Answer 2 · answered Oct 26 '20 at 06:59

Here is a tidyverse solution.

add_row() will add a row of NAs. map(unique) will get the unique values per column. And, expand.grid() will put all combinations into a data frame.

library(tidyverse)

plusminus_df %>%
  add_row() %>%
  map(unique) %>%
  expand.grid()
#>         A    B C D E
#>   1     +    + + + +
#>   2     -    + + + +
#>   3  <NA>    + + + +
#>   4     +    - + + +
#>   5     -    - + + +
#>   6  <NA>    - + + +
#>   7     + <NA> + + +
#>   8     - <NA> + + +
#>   9  <NA> <NA> + + +
#>   10    +    + - + +
#>   11    -    + - + +
#>   12 <NA>    + - + +
#>   13    +    - - + +
#>   14    -    - - + +
#>   15 <NA>    - - + +
#>   16    + <NA> - + +
#>   17    - <NA> - + +
#>   ...

score 3 · Answer 3 · answered Oct 26 '20 at 21:54

I think you can try permutations from package gtools directly

library(gtools)
x <- c("+", "-", NA)
colNames <- LETTERS[1:5]
final_df <- as.data.frame(
  permutations(length(x), length(colNames), x, set = FALSE, repeats.allowed = TRUE),
  col.naems = colNames
)

which gives you

> final_df
       A    B    C    D    E
1      +    +    +    +    +
2      +    +    +    +    -
3      +    +    +    + <NA>
4      +    +    +    -    +
5      +    +    +    -    -
6      +    +    +    - <NA>
7      +    +    + <NA>    +
8      +    +    + <NA>    -
9      +    +    + <NA> <NA>
10     +    +    -    +    +
11     +    +    -    +    -
12     +    +    -    + <NA>
13     +    +    -    -    +
14     +    +    -    -    -
15     +    +    -    - <NA>
16     +    +    - <NA>    +
17     +    +    - <NA>    -
18     +    +    - <NA> <NA>
19     +    + <NA>    +    +
20     +    + <NA>    +    -
21     +    + <NA>    + <NA>
22     +    + <NA>    -    +
23     +    + <NA>    -    -
24     +    + <NA>    - <NA>
25     +    + <NA> <NA>    +
26     +    + <NA> <NA>    -
27     +    + <NA> <NA> <NA>
28     +    -    +    +    +
29     +    -    +    +    -
30     +    -    +    + <NA>
31     +    -    +    -    +
32     +    -    +    -    -
33     +    -    +    - <NA>
34     +    -    + <NA>    +
35     +    -    + <NA>    -
36     +    -    + <NA> <NA>
37     +    -    -    +    +
38     +    -    -    +    -
39     +    -    -    + <NA>
40     +    -    -    -    +
41     +    -    -    -    -
42     +    -    -    - <NA>
43     +    -    - <NA>    +
44     +    -    - <NA>    -
45     +    -    - <NA> <NA>
46     +    - <NA>    +    +
47     +    - <NA>    +    -
48     +    - <NA>    + <NA>
49     +    - <NA>    -    +
50     +    - <NA>    -    -
51     +    - <NA>    - <NA>
52     +    - <NA> <NA>    +
53     +    - <NA> <NA>    -
54     +    - <NA> <NA> <NA>
55     + <NA>    +    +    +
56     + <NA>    +    +    -
57     + <NA>    +    + <NA>
58     + <NA>    +    -    +
59     + <NA>    +    -    -
60     + <NA>    +    - <NA>
61     + <NA>    + <NA>    +
62     + <NA>    + <NA>    -
63     + <NA>    + <NA> <NA>
64     + <NA>    -    +    +
65     + <NA>    -    +    -
66     + <NA>    -    + <NA>
67     + <NA>    -    -    +
68     + <NA>    -    -    -
69     + <NA>    -    - <NA>
70     + <NA>    - <NA>    +
71     + <NA>    - <NA>    -
72     + <NA>    - <NA> <NA>
73     + <NA> <NA>    +    +
74     + <NA> <NA>    +    -
75     + <NA> <NA>    + <NA>
76     + <NA> <NA>    -    +
77     + <NA> <NA>    -    -
78     + <NA> <NA>    - <NA>
79     + <NA> <NA> <NA>    +
80     + <NA> <NA> <NA>    -
81     + <NA> <NA> <NA> <NA>
82     -    +    +    +    +
83     -    +    +    +    -
84     -    +    +    + <NA>
85     -    +    +    -    +
86     -    +    +    -    -
87     -    +    +    - <NA>
88     -    +    + <NA>    +
89     -    +    + <NA>    -
90     -    +    + <NA> <NA>
91     -    +    -    +    +
92     -    +    -    +    -
93     -    +    -    + <NA>
94     -    +    -    -    +
95     -    +    -    -    -
96     -    +    -    - <NA>
97     -    +    - <NA>    +
98     -    +    - <NA>    -
99     -    +    - <NA> <NA>
100    -    + <NA>    +    +
101    -    + <NA>    +    -
102    -    + <NA>    + <NA>
103    -    + <NA>    -    +
104    -    + <NA>    -    -
105    -    + <NA>    - <NA>
106    -    + <NA> <NA>    +
107    -    + <NA> <NA>    -
108    -    + <NA> <NA> <NA>
109    -    -    +    +    +
110    -    -    +    +    -
111    -    -    +    + <NA>
112    -    -    +    -    +
113    -    -    +    -    -
114    -    -    +    - <NA>
115    -    -    + <NA>    +
116    -    -    + <NA>    -
117    -    -    + <NA> <NA>
118    -    -    -    +    +
119    -    -    -    +    -
120    -    -    -    + <NA>
121    -    -    -    -    +
122    -    -    -    -    -
123    -    -    -    - <NA>
124    -    -    - <NA>    +
125    -    -    - <NA>    -
126    -    -    - <NA> <NA>
127    -    - <NA>    +    +
128    -    - <NA>    +    -
129    -    - <NA>    + <NA>
130    -    - <NA>    -    +
131    -    - <NA>    -    -
132    -    - <NA>    - <NA>
133    -    - <NA> <NA>    +
134    -    - <NA> <NA>    -
135    -    - <NA> <NA> <NA>
136    - <NA>    +    +    +
137    - <NA>    +    +    -
138    - <NA>    +    + <NA>
139    - <NA>    +    -    +
140    - <NA>    +    -    -
141    - <NA>    +    - <NA>
142    - <NA>    + <NA>    +
143    - <NA>    + <NA>    -
144    - <NA>    + <NA> <NA>
145    - <NA>    -    +    +
146    - <NA>    -    +    -
147    - <NA>    -    + <NA>
148    - <NA>    -    -    +
149    - <NA>    -    -    -
150    - <NA>    -    - <NA>
151    - <NA>    - <NA>    +
152    - <NA>    - <NA>    -
153    - <NA>    - <NA> <NA>
154    - <NA> <NA>    +    +
155    - <NA> <NA>    +    -
156    - <NA> <NA>    + <NA>
157    - <NA> <NA>    -    +
158    - <NA> <NA>    -    -
159    - <NA> <NA>    - <NA>
160    - <NA> <NA> <NA>    +
161    - <NA> <NA> <NA>    -
162    - <NA> <NA> <NA> <NA>
163 <NA>    +    +    +    +
164 <NA>    +    +    +    -
165 <NA>    +    +    + <NA>
166 <NA>    +    +    -    +
167 <NA>    +    +    -    -
168 <NA>    +    +    - <NA>
169 <NA>    +    + <NA>    +
170 <NA>    +    + <NA>    -
171 <NA>    +    + <NA> <NA>
172 <NA>    +    -    +    +
173 <NA>    +    -    +    -
174 <NA>    +    -    + <NA>
175 <NA>    +    -    -    +
176 <NA>    +    -    -    -
177 <NA>    +    -    - <NA>
178 <NA>    +    - <NA>    +
179 <NA>    +    - <NA>    -
180 <NA>    +    - <NA> <NA>
181 <NA>    + <NA>    +    +
182 <NA>    + <NA>    +    -
183 <NA>    + <NA>    + <NA>
184 <NA>    + <NA>    -    +
185 <NA>    + <NA>    -    -
186 <NA>    + <NA>    - <NA>
187 <NA>    + <NA> <NA>    +
188 <NA>    + <NA> <NA>    -
189 <NA>    + <NA> <NA> <NA>
190 <NA>    -    +    +    +
191 <NA>    -    +    +    -
192 <NA>    -    +    + <NA>
193 <NA>    -    +    -    +
194 <NA>    -    +    -    -
195 <NA>    -    +    - <NA>
196 <NA>    -    + <NA>    +
197 <NA>    -    + <NA>    -
198 <NA>    -    + <NA> <NA>
199 <NA>    -    -    +    +
200 <NA>    -    -    +    -
201 <NA>    -    -    + <NA>
202 <NA>    -    -    -    +
203 <NA>    -    -    -    -
204 <NA>    -    -    - <NA>
205 <NA>    -    - <NA>    +
206 <NA>    -    - <NA>    -
207 <NA>    -    - <NA> <NA>
208 <NA>    - <NA>    +    +
209 <NA>    - <NA>    +    -
210 <NA>    - <NA>    + <NA>
211 <NA>    - <NA>    -    +
212 <NA>    - <NA>    -    -
213 <NA>    - <NA>    - <NA>
214 <NA>    - <NA> <NA>    +
215 <NA>    - <NA> <NA>    -
216 <NA>    - <NA> <NA> <NA>
217 <NA> <NA>    +    +    +
218 <NA> <NA>    +    +    -
219 <NA> <NA>    +    + <NA>
220 <NA> <NA>    +    -    +
221 <NA> <NA>    +    -    -
222 <NA> <NA>    +    - <NA>
223 <NA> <NA>    + <NA>    +
224 <NA> <NA>    + <NA>    -
225 <NA> <NA>    + <NA> <NA>
226 <NA> <NA>    -    +    +
227 <NA> <NA>    -    +    -
228 <NA> <NA>    -    + <NA>
229 <NA> <NA>    -    -    +
230 <NA> <NA>    -    -    -
231 <NA> <NA>    -    - <NA>
232 <NA> <NA>    - <NA>    +
233 <NA> <NA>    - <NA>    -
234 <NA> <NA>    - <NA> <NA>
235 <NA> <NA> <NA>    +    +
236 <NA> <NA> <NA>    +    -
237 <NA> <NA> <NA>    + <NA>
238 <NA> <NA> <NA>    -    +
239 <NA> <NA> <NA>    -    -
240 <NA> <NA> <NA>    - <NA>
241 <NA> <NA> <NA> <NA>    +
242 <NA> <NA> <NA> <NA>    -
243 <NA> <NA> <NA> <NA> <NA>

score 3 · Accepted Answer · answered Oct 27 '20 at 09:15

I'm not completely certain I've understood what you're looking for, but from the second question it looks like you are looking for all cross-combinations of columns within a data.frame.

Disclaimer: The two answers already provided are more readable, where I focus on speed.

As you are performing what is often known as a cross-join (or outer-full-join) one aspect that quickly becomes a concern as n increases is efficiency. For efficiency it helps to split the problem into smaller sub-problems, and find a solution for each problem. As we need to find all unique combinations within the set of columns including the null set (value = NA), we can reduce this problem into 2 sub-problems.

Find unique values for each column including the null set
Expand this set to include all combinations of each column.

Using this idea we can quickly concoct a simple solution using expand.grid, unique and lapply. The only tricky part is to include the null set, but we can do this by selecting NA row from the data.frame including all rows.

# Create null-set-included data.frame
nullset_df <- plusminus_df[c(NA, seq_len(nrow(plusminus_df))), ]
# Find all unique elements, including null set
unique_df <- lapply(nullset_df, unique)
# Combine all unique sets
expand.grid(unique_df)

or as a function

nullgrid.expand <- function(df, ...)
  expand.grid(lapply(df[c(NA, seq_len(nrow(df))), ], unique), ...)

This is fairly fast (benchmarks and performance graphs later), but I wanted to go one step further. The data.table package is known for it's high-performance functions, and one of those functions in the CJ function, for performing cross-joins. Below is one implementation using CJ

library(data.table)
nullgrid.expand.dt <- function(df, ...)
  do.call(CJ, args = c(as.list(df[c(NA, seq_len(nrow(df))), ]),
                       sorted = FALSE,
                       unique = TRUE))

The function requires vector input, forcing one to use do.call (or similar) which makes the function slightly less readable. But is there any performance gain? To test this, I ran a microbenchmark on the two functions and the ones provided by the existing answers (code below), the result is visualized in a violin plot below:

From the plot it is quite clear that @pauls answer outperforms @ekoam's answer, but the two functions above both outperform the provided answers in terms of speed. But the question said that the input might have any number of dimension, so there is also the question of how well our function scales with the number of columns and the number of unique values (here we only have "+" and "-" but what if we had more?). For this I reran the benchmark for n_columns = 3, 4, ..., 10 and n_values = 2, 4, ... 10. The 2 results are visualized with smooth curves below.
First we'll visualize the time as a function of number of columns. Note that the y axis is on logarithmic scale (base 10) for easier comparison.

From the visualization it is quite clear that, with increasing number of columns, the choice of method becomes very important. The suggestion by @ekoam becomes very slow, primarily because it delays a call to unique till the very end. The remaining 3 methods are all much faster, while nullgrid.expand.dt becomes more than 10 times faster in comparison to the remaining methods once we get more than 8 columns of data.

Next lets look at the timing compared to the number of values in each column (n-columns fixed at 5)

Again we see a similar picture. Except for a single possible outlier for nullgrid.expand, which seems to become slower than the answer by paul as the number of unique values increase, we see that nullgrid.expand.dt remains faster, although here it seems to only be saving (exp(4) - exp(3.6)) / exp(3.6) ~ 50 % (or twice as fast) compared to paul's answer by the time we reach 10 unique values.

Please note that I did not have enough RAM to run the benchmark for number of unique values or columns greater than the ones shown.

Conclusion

We've seen that there are many ways to reach the answer sought by the question, but as the number of columns and unique values increase the choice of method becomes more and more important. By utilizing optimized libraries, we can drastically reduce the time required to get the cross-join of all column values, with only minimal effort. With extended effort using Rcpp we could likely reduce the time complexity even further, while this is outside the scope of my answer.

Benchmark code

# Setup:
set.seed(1234)
library(tidyverse)
library(data.table)
nullgrid.expand <- function(df, ...)
  expand.grid(lapply(df[c(NA, seq_len(nrow(df))), ], unique), ...)
nullgrid.expand.dt <- function(df, ...)
  do.call(CJ, args = c(as.list(df[c(NA, seq_len(nrow(df))), ]),
                       sorted = FALSE,
                       unique = TRUE))
markers=LETTERS[1:5]
plusminus_df <- expand.grid(lapply(seq(markers), function(x) c("+","-")), stringsAsFactors=FALSE)
names(plusminus_df)=LETTERS[1:5]

bm <- microbenchmark(
  nullgrid.expand = nullgrid.expand(plusminus_df),
  nullgrid.expand.dt = nullgrid.expand.dt(plusminus_df),
  ekoam = unique(bind_rows(apply(
    plusminus_df, 1L, 
    function(r) head(expand.grid(lapply(r, c, NA_character_), stringsAsFactors = FALSE), -1L)
  ))),
  paul = {
    plusminus_df %>%
      add_row() %>%
      map(unique) %>%
      expand.grid()
  }, 
  control = list(warmup = 5)
)
library(ggplot2)
autoplot(bm) + ggtitle('comparison between cross-join')

Timing function

time_function <- function(n = 5, j = 2){
  idx <- seq_len(n)
  df <- do.call(CJ, args = c(lapply(idx, function(x) as.character(seq_len(j))),
                             sorted = FALSE,
                             unique = TRUE))
  names(df) <- as.character(idx)
  microbenchmark(
    nullgrid.expand = nullgrid.expand(df),
    nullgrid.expand.dt = nullgrid.expand.dt(df),
    ekoam = unique(bind_rows(apply(
      df, 1L, 
      function(r) head(expand.grid(lapply(r, c, NA_character_), stringsAsFactors = FALSE), -1L)
    ))),
    paul = {
      df %>%
        add_row() %>%
        map(unique) %>%
        expand.grid()
    }, 
    times = 10,
    control = list(warmup = 5)
  )
}
res <- lapply(seq(3, 10), time_function)
for(i in seq_along(res)){
  res[[i]]$n <- seq(3, 10)[i]
}
ggplot(rbindlist(res), aes(x = n, y = log(time / 10^4, base = 10), col = expr)) + 
  geom_smooth(se = FALSE) + 
  ggtitle('time-comparison given number of columns') + 
  labs(y = 'log(ms)', x = 'n')
ggsave('so_2.png')

res <- lapply(c(seq(2, 10, 2)), time_function, n = 5)
for(i in seq_along(res)){
  res[[i]]$n <- seq(2, 10, 2)[i]
}
ggplot(rbindlist(res), aes(x = n, y = log(time / 10^4, base = 10), col = expr)) + 
  geom_smooth(se = FALSE) + 
  ggtitle('time-comparison given number of unique values') + 
  labs(y = 'log(ms)', x = 'n unique values per column')
ggsave('so_3.png')

This is great. But I think my answer is slow because I apply `expand.grid` to every row in the dataframe, not because I delay the `unique` function call. Also, there is no need to call `unique` earlier because rows are always unique in the resultant dataframe for each `expand.grid` operation (but may not be unique across operations). — ekoam, Oct 27 '20 at 10:38
I could've been more clear. You apply "unique after `expand.grid`" was meant as, the answer provided by Paul converts the `data.frame` to a list and calls `expand.grid` on the unique elements of each column. This is exactly as you say, that you apply `expand.grid` to every row of the `data.frame`. :-) — Oliver, Oct 27 '20 at 10:43
I'd also like to say, that there is nothing wrong with either answer. If the data is of a size, where either solution can handle it, the only reason to "care" would be, that one had to call the function repeatably. — Oliver, Oct 27 '20 at 10:49

R: extract inner higher level combinations (groups of 1, 2, 3, and 4 elements) out of a data frame of combinations of 5 elements

4 Answers4

Conclusion

Benchmark code

Timing function

Linked