Making Combinations of Items

Question

Suppose I have the following lists of factor:

factor_1 = c("A1", "A2", "A3")
factor_2 = c("B1", "B2")
factor_3 = c("C1", "C2", "C3", "C4")
factor_4 = c("D1", "D2", "D3")

I made the following data frame that contains all (3 * 2 * 4 * 3 = ) 72 combinations of these factors:

data_exp <- expand.grid(factor_1, factor_2, factor_3, factor_4) 
data_exp$id = 1:nrow(data_exp)

> head(data_exp)
  Var1 Var2 Var3 Var4 id
1   A1   B1   C1   D1  1
2   A2   B1   C1   D1  2
3   A3   B1   C1   D1  3
4   A1   B2   C1   D1  4
5   A2   B2   C1   D1  5
6   A3   B2   C1   D1  6

I want to randomly split this data (data_exp) into 3 datasets such that each row only appears in one of these datasets - furthermore, these 3 datasets do not have to be the same size. I tried to do this with the following code.

First, I randomly generate 3 random numbers corresponding to the number of rows for each of these datasets, such that these 3 random numbers add to 72:

# https://stackoverflow.com/questions/24845909/generate-n-random-integers-that-sum-to-m-in-r

rand_vect <- function(N, M, sd = 1, pos.only = TRUE) {
  vec <- rnorm(N, M/N, sd)
  if (abs(sum(vec)) < 0.01) vec <- vec + 1
  vec <- round(vec / sum(vec) * M)
  deviation <- M - sum(vec)
  for (. in seq_len(abs(deviation))) {
    vec[i] <- vec[i <- sample(N, 1)] + sign(deviation)
  }
  if (pos.only) while (any(vec < 0)) {
    negs <- vec < 0
    pos  <- vec > 0
    vec[negs][i] <- vec[negs][i <- sample(sum(negs), 1)] + 1
    vec[pos][i]  <- vec[pos ][i <- sample(sum(pos ), 1)] - 1
  }
  vec
}

r = rand_vect(3, 72)
[1] 26 23 23

Next, I tried to create these datasets using these random numbers:

data_1 = data_exp[sample(nrow(data_exp), r[1]), ]
data_2 = data_exp[sample(nrow(data_exp), r[2]), ]
data_3 = data_exp[sample(nrow(data_exp), r[3]), ]

The problem with this approach is that data_1, data_2, data_3 have common rows, and not all the rows from data_exp are necessarily present within data_1, data_2, data_3.

Is there a way to fix this problem?

Thank you!

`split(data_exp, sample(gl(3,72/3,72)))` should work. Note that your data already has unique rows. There will be no duplication anywhere — Onyambu, Apr 20 '22 at 05:00
Thank you! But this makes all 3 datasets with an equal number of rows. Is there a way to adapt your code so to make all 3 datasets with a random number of rows? Thanks! — stats_noob, Apr 20 '22 at 15:14
Have also a look at [Randomly sample data frame into 3 groups in R](https://stackoverflow.com/q/34028371/10488504) which will in addition allow to give different shares of the 3 groups. — GKi, Apr 25 '22 at 10:23

score 7 · Answer 1 · answered Apr 20 '22 at 15:19

7

Use the following:

 n <- 3
 split(data_exp, sample(rep(seq(n), rmultinom(1, nrow(data_exp), rep(1, n)))))

answered Apr 20 '22 at 15:19

Onyambu

67,392
3
24
53

Thank you! But is it possible tp adapt the code I used to make this work? Thank you so much! – stats_noob Apr 22 '22 at 02:48
1

@antonoyaro8 in that case you could do `split(data_exp, sample(rep(seq(3), rand_vect(3, 72))))` – Onyambu Apr 22 '22 at 05:06
Brilliant `rmultinom` approach! +1! – ThomasIsCoding Apr 22 '22 at 08:49
1

Is there an advantage using `sample(rep(seq(n), rmultinom(1, nrow(data_exp), rep(1, n))))` instead of `sample(3, nrow(data_exp), TRUE)`? – GKi Apr 25 '22 at 09:55
```rmultinom``` can return 0, causing us to end up with less than 3 datasets. – M-- Apr 27 '22 at 19:31

ThomasIsCoding · Answer 2 · 2022-04-28T13:13:57.773

4

You can try the code below (given n <- 3)

split(data_exp, paste0("data_", sample(n, nrow(data_exp), replace = TRUE)))

or

split(data_exp, paste0("data_", sort(sample(n, nrow(data_exp), replace = TRUE))))

If you want to generate variables in the global environment, you should use list2env:

list2env(
    split(data_exp, paste0("data_", sample(n, nrow(data_exp), replace = TRUE))),
    envir = .GlobalEnv
)

or

list2env(
    split(data_exp, paste0("data_", sort(sample(n, nrow(data_exp), replace = TRUE)))),
    envir = .GlobalEnv
)

edited Apr 28 '22 at 13:13

answered Apr 22 '22 at 08:23

ThomasIsCoding

96,636
9
24
81

Thank you! And what is "n" is not equal 3 ... could this code still work? – stats_noob Apr 22 '22 at 14:34
@antonoyaro8 yes, I think it should work – ThomasIsCoding Apr 22 '22 at 14:47

Quinten · Answer 3 · 2022-04-25T10:44:40.273

You can use the following code which will split the data in three data frames based on your r to the corresponding names:

data <- setNames(split(data_exp, sample(rep(1:3, r))), c("data_1","data_2","data_3"))

When checking the data by number of rows:

> nrow(data$data_1)
[1] 26
> nrow(data$data_2)
[1] 23
> nrow(data$data_3)
[1] 23

Let's check the dataframes:

> data$data_1
   Var1 Var2 Var3 Var4 id
3    A3   B1   C1   D1  3
8    A2   B1   C2   D1  8
10   A1   B2   C2   D1 10
11   A2   B2   C2   D1 11
12   A3   B2   C2   D1 12
13   A1   B1   C3   D1 13
17   A2   B2   C3   D1 17
20   A2   B1   C4   D1 20
26   A2   B1   C1   D2 26
27   A3   B1   C1   D2 27
28   A1   B2   C1   D2 28
31   A1   B1   C2   D2 31
35   A2   B2   C2   D2 35
36   A3   B2   C2   D2 36
38   A2   B1   C3   D2 38
39   A3   B1   C3   D2 39
40   A1   B2   C3   D2 40
45   A3   B1   C4   D2 45
48   A3   B2   C4   D2 48
49   A1   B1   C1   D3 49
50   A2   B1   C1   D3 50
52   A1   B2   C1   D3 52
54   A3   B2   C1   D3 54
56   A2   B1   C2   D3 56
68   A2   B1   C4   D3 68
70   A1   B2   C4   D3 70
> data$data_2
   Var1 Var2 Var3 Var4 id
1    A1   B1   C1   D1  1
4    A1   B2   C1   D1  4
7    A1   B1   C2   D1  7
9    A3   B1   C2   D1  9
15   A3   B1   C3   D1 15
23   A2   B2   C4   D1 23
25   A1   B1   C1   D2 25
29   A2   B2   C1   D2 29
37   A1   B1   C3   D2 37
41   A2   B2   C3   D2 41
42   A3   B2   C3   D2 42
44   A2   B1   C4   D2 44
51   A3   B1   C1   D3 51
53   A2   B2   C1   D3 53
55   A1   B1   C2   D3 55
57   A3   B1   C2   D3 57
60   A3   B2   C2   D3 60
61   A1   B1   C3   D3 61
63   A3   B1   C3   D3 63
64   A1   B2   C3   D3 64
66   A3   B2   C3   D3 66
69   A3   B1   C4   D3 69
71   A2   B2   C4   D3 71
> data$data_3
   Var1 Var2 Var3 Var4 id
2    A2   B1   C1   D1  2
5    A2   B2   C1   D1  5
6    A3   B2   C1   D1  6
14   A2   B1   C3   D1 14
16   A1   B2   C3   D1 16
18   A3   B2   C3   D1 18
19   A1   B1   C4   D1 19
21   A3   B1   C4   D1 21
22   A1   B2   C4   D1 22
24   A3   B2   C4   D1 24
30   A3   B2   C1   D2 30
32   A2   B1   C2   D2 32
33   A3   B1   C2   D2 33
34   A1   B2   C2   D2 34
43   A1   B1   C4   D2 43
46   A1   B2   C4   D2 46
47   A2   B2   C4   D2 47
58   A1   B2   C2   D3 58
59   A2   B2   C2   D3 59
62   A2   B1   C3   D3 62
65   A2   B2   C3   D3 65
67   A1   B1   C4   D3 67
72   A3   B2   C4   D3 72

As you can see the rows appear only once in one of the three dataframes. The data frames got the number of rows from your function.

Edit

Great suggestion by @GKi. You can also use the following code:

setNames(split(data_exp, sample(3, nrow(data_exp), TRUE)), c("data_1","data_2","data_3"))

Maybe use instead of `sample(rep(1:3, r))` `sample(rep_len(1:3, nrow(data_exp)))` or better `sample(3, nrow(data_exp), TRUE)`. — GKi, Apr 25 '22 at 09:20
@antonoyaro8, Edited my answer with great suggestion by GKi! It does the same with less code. — Quinten, Apr 25 '22 at 10:46

Andrew Gustar · Answer 4 · 2022-04-22T16:56:48.563

A simple way to do it would be

data_list <- split(data_exp, 
                   sample(rep(c(1:3), 
                              diff(c(0, sort(sample(72, 2)), 72)))))

This generates three numbers (say, x, y and z) that add to 72 (third line). The rep generates a sequence of x 1s, y 2s and z 3s, that are interpreted as the factor levels for splitting the dataframe. sample shuffles this sequence of factors.

The output is a list of three dataframes, which you can easily convert to three separate dataframes with data_1 <- data_list[[1]] etc.

Note that this method of generating three numbers adding up to 72 will create quite a wide variation between runs, compared to some of the other answers here that use a method along the lines of sample(1:3, 72) which will tend to produce about 24 of each number, roughly normally distributed. You should think about what sort of distribution you need. The following simulations illustrate the difference...

summary(as.vector(sapply(1:1000, function(i) diff(c(0, sort(sample(72, 2)), 72)))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      10      21      24      36      71 

summary(as.vector(sapply(1:1000, function(i) tabulate(sample(3, 72, replace = TRUE)))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     11      21      24      24      27      37

Abdur Rohman · Answer 5 · 2022-04-22T17:17:27.883

Here is another way of randomly splitting the data without sampling by modifying your code:

Generate N random numbers, with N equals the number of rows of data_exp, and then assign its orders to an object. This object will contain the same numbers as the row numbers of data_exp, but with a random order.

set.seed(1) # This is set to enable reproduction of the result. 
rand = runif(nrow(data_exp))
orders = order(rand)
orders
# [1] 27 47 10 55 69 56 38 24 12 34  5 11 22 54 66  1 25 62 57 64 71
#[22] 30  2 19 28 14 26 60 40 53 63 48 51 67 31 33 16 58 45 44  3 32
#[43]  9 42 65 23  8 59 36 13 50 17 39 49 68 15 20 43 46 37 41 35 72
#[64] 52 29 70  6  4 61 21  7 18

Use your rand_vect function to generate three numbers of which the sum is 72.

set.seed(1)
r = rand_vect(3,72)
r
#[1] 24 25 23
ind = cumsum(r)
ind
#[1] 24 49 72

Subset data_exp by using r this way:

data_1 = data_exp[1:ind[1], ]
data_2 = data_exp[(ind[1]+1):ind[2], ]
data_3 = data_exp[(ind[2]+1):ind[3], ]

Check if the number of rows of data_1, data_2, and data_3 equals the element of r and check if there are common rows among them:

nrow(data_1)
#[1] 24
nrow(data_2)
#[1] 25
nrow(data_3)
#[1] 23
merge(data_1, data_2)
#[1] Var1 Var2 Var3 Var4 id  
#<0 rows> (or 0-length row.names)
merge(data_1, data_3)
#[1] Var1 Var2 Var3 Var4 id  
#<0 rows> (or 0-length row.names)
merge(data_2, data_3)
#[1] Var1 Var2 Var3 Var4 id  
#<0 rows> (or 0-length row.names)

Here is a function that implements the steps mentioned above:

rand_split = function(yourdf, nsplit, seed = 1){
#1. Generate a random order of row

set.seed(seed)

rand = runif(nrow(yourdf))
orders = order(rand)

#2. Set how many parts into which yourdf will be split. Each part has a random number of rows stored in r.

r   = rand_vect(nsplit, nrow(yourdf))

#3. Generate the indices that will be used to select the random order of row 
 
ind = cumsum(r)
n   = length(ind)
k   = c(1, ind[1:(n-1)] +1)
m   = Map(seq, k, ind)

#4. Split yourdf randomly, stored the resulted parts in a list, and name each element with data_1, data_2, and so on. 

outdf = lapply(m, function(x) data_exp[orders[x],])
names(outdf) = paste0("data_", seq_len(n))
output = c(nrow = r, outdf)
return(output)
}

This function can accept any nsplit and seed number. Example:

rand_split(data_exp, 5, 2)

$nrow1
[1] 13

$nrow2
[1] 14

$nrow3
[1] 15

$nrow4
[1] 15

$nrow5
[1] 15

$data_1
   Var1 Var2 Var3 Var4 id
51   A3   B1   C1   D3 51
31   A1   B1   C2   D2 31
52   A1   B2   C1   D3 52
70   A1   B2   C4   D3 70
20   A2   B1   C4   D1 20
43   A1   B1   C4   D2 43
7    A1   B1   C2   D1  7
30   A3   B2   C1   D2 30
27   A3   B1   C1   D2 27
40   A1   B2   C3   D2 40
24   A3   B2   C4   D1 24
44   A2   B1   C4   D2 44
32   A2   B1   C2   D2 32


# and more...

denis · Answer 6 · 2022-04-22T15:29:57.760

2

Keeping your original idea (which is a nice idea, although using split as proposed by KU99 is the most efficient and simple way of proceeding I think):

the only missing step in your code is to remove the rows you are sampling, so you do not sample them later.

Using the code you gave, you could do:

r = rand_vect(3, 72)

rowlist = 1:nrow(data_exp)

lapply(seq(r),function(i){

  chosen_rows <- sample(rowlist,r[i])  
  rowlist <<- rowlist[!rowlist %in% chosen_rows]
  data_exp[chosen_rows,]
})

To obtain the list of your 3 subsets.

Here, for each subsetting, I remove the chosen lines from the lines to be sampled for the next subset (rowlist <<- rowlist[!rowlist %in% chosen_id]). This way, you do not sample lines that were already sampled.

edited Apr 22 '22 at 15:29

answered Apr 22 '22 at 13:34

denis

5,580
1
13
40

Thank you so much! Just a question - would your code also work if n =4 ? (or any value of n)? Thank you so much! – stats_noob Apr 22 '22 at 14:34
Yes, it should work for any value, as your function rand_vec provide `n` values – denis Apr 22 '22 at 15:28

Making Combinations of Items

6 Answers6

Edit

Linked