7

Suppose I have the following lists of factor:

factor_1 = c("A1", "A2", "A3")
factor_2 = c("B1", "B2")
factor_3 = c("C1", "C2", "C3", "C4")
factor_4 = c("D1", "D2", "D3")

I made the following data frame that contains all (3 * 2 * 4 * 3 = ) 72 combinations of these factors:

data_exp <- expand.grid(factor_1, factor_2, factor_3, factor_4) 
data_exp$id = 1:nrow(data_exp)

> head(data_exp)
  Var1 Var2 Var3 Var4 id
1   A1   B1   C1   D1  1
2   A2   B1   C1   D1  2
3   A3   B1   C1   D1  3
4   A1   B2   C1   D1  4
5   A2   B2   C1   D1  5
6   A3   B2   C1   D1  6

I want to randomly split this data (data_exp) into 3 datasets such that each row only appears in one of these datasets - furthermore, these 3 datasets do not have to be the same size. I tried to do this with the following code.

First, I randomly generate 3 random numbers corresponding to the number of rows for each of these datasets, such that these 3 random numbers add to 72:

# https://stackoverflow.com/questions/24845909/generate-n-random-integers-that-sum-to-m-in-r

rand_vect <- function(N, M, sd = 1, pos.only = TRUE) {
  vec <- rnorm(N, M/N, sd)
  if (abs(sum(vec)) < 0.01) vec <- vec + 1
  vec <- round(vec / sum(vec) * M)
  deviation <- M - sum(vec)
  for (. in seq_len(abs(deviation))) {
    vec[i] <- vec[i <- sample(N, 1)] + sign(deviation)
  }
  if (pos.only) while (any(vec < 0)) {
    negs <- vec < 0
    pos  <- vec > 0
    vec[negs][i] <- vec[negs][i <- sample(sum(negs), 1)] + 1
    vec[pos][i]  <- vec[pos ][i <- sample(sum(pos ), 1)] - 1
  }
  vec
}

r = rand_vect(3, 72)
[1] 26 23 23

Next, I tried to create these datasets using these random numbers:

data_1 = data_exp[sample(nrow(data_exp), r[1]), ]
data_2 = data_exp[sample(nrow(data_exp), r[2]), ]
data_3 = data_exp[sample(nrow(data_exp), r[3]), ]
  • The problem with this approach is that data_1, data_2, data_3 have common rows, and not all the rows from data_exp are necessarily present within data_1, data_2, data_3.

Is there a way to fix this problem?

Thank you!

stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 1
    `split(data_exp, sample(gl(3,72/3,72)))` should work. Note that your data already has unique rows. There will be no duplication anywhere – Onyambu Apr 20 '22 at 05:00
  • Thank you! But this makes all 3 datasets with an equal number of rows. Is there a way to adapt your code so to make all 3 datasets with a random number of rows? Thanks! – stats_noob Apr 20 '22 at 15:14
  • Have also a look at [Randomly sample data frame into 3 groups in R](https://stackoverflow.com/q/34028371/10488504) which will in addition allow to give different shares of the 3 groups. – GKi Apr 25 '22 at 10:23

6 Answers6

7

Use the following:

 n <- 3
 split(data_exp, sample(rep(seq(n), rmultinom(1, nrow(data_exp), rep(1, n)))))
Onyambu
  • 67,392
  • 3
  • 24
  • 53
4

You can try the code below (given n <- 3)

split(data_exp, paste0("data_", sample(n, nrow(data_exp), replace = TRUE)))

or

split(data_exp, paste0("data_", sort(sample(n, nrow(data_exp), replace = TRUE))))

If you want to generate variables in the global environment, you should use list2env:

list2env(
    split(data_exp, paste0("data_", sample(n, nrow(data_exp), replace = TRUE))),
    envir = .GlobalEnv
)

or

list2env(
    split(data_exp, paste0("data_", sort(sample(n, nrow(data_exp), replace = TRUE)))),
    envir = .GlobalEnv
)

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
3

You can use the following code which will split the data in three data frames based on your r to the corresponding names:

data <- setNames(split(data_exp, sample(rep(1:3, r))), c("data_1","data_2","data_3"))

When checking the data by number of rows:

> nrow(data$data_1)
[1] 26
> nrow(data$data_2)
[1] 23
> nrow(data$data_3)
[1] 23

Let's check the dataframes:

> data$data_1
   Var1 Var2 Var3 Var4 id
3    A3   B1   C1   D1  3
8    A2   B1   C2   D1  8
10   A1   B2   C2   D1 10
11   A2   B2   C2   D1 11
12   A3   B2   C2   D1 12
13   A1   B1   C3   D1 13
17   A2   B2   C3   D1 17
20   A2   B1   C4   D1 20
26   A2   B1   C1   D2 26
27   A3   B1   C1   D2 27
28   A1   B2   C1   D2 28
31   A1   B1   C2   D2 31
35   A2   B2   C2   D2 35
36   A3   B2   C2   D2 36
38   A2   B1   C3   D2 38
39   A3   B1   C3   D2 39
40   A1   B2   C3   D2 40
45   A3   B1   C4   D2 45
48   A3   B2   C4   D2 48
49   A1   B1   C1   D3 49
50   A2   B1   C1   D3 50
52   A1   B2   C1   D3 52
54   A3   B2   C1   D3 54
56   A2   B1   C2   D3 56
68   A2   B1   C4   D3 68
70   A1   B2   C4   D3 70
> data$data_2
   Var1 Var2 Var3 Var4 id
1    A1   B1   C1   D1  1
4    A1   B2   C1   D1  4
7    A1   B1   C2   D1  7
9    A3   B1   C2   D1  9
15   A3   B1   C3   D1 15
23   A2   B2   C4   D1 23
25   A1   B1   C1   D2 25
29   A2   B2   C1   D2 29
37   A1   B1   C3   D2 37
41   A2   B2   C3   D2 41
42   A3   B2   C3   D2 42
44   A2   B1   C4   D2 44
51   A3   B1   C1   D3 51
53   A2   B2   C1   D3 53
55   A1   B1   C2   D3 55
57   A3   B1   C2   D3 57
60   A3   B2   C2   D3 60
61   A1   B1   C3   D3 61
63   A3   B1   C3   D3 63
64   A1   B2   C3   D3 64
66   A3   B2   C3   D3 66
69   A3   B1   C4   D3 69
71   A2   B2   C4   D3 71
> data$data_3
   Var1 Var2 Var3 Var4 id
2    A2   B1   C1   D1  2
5    A2   B2   C1   D1  5
6    A3   B2   C1   D1  6
14   A2   B1   C3   D1 14
16   A1   B2   C3   D1 16
18   A3   B2   C3   D1 18
19   A1   B1   C4   D1 19
21   A3   B1   C4   D1 21
22   A1   B2   C4   D1 22
24   A3   B2   C4   D1 24
30   A3   B2   C1   D2 30
32   A2   B1   C2   D2 32
33   A3   B1   C2   D2 33
34   A1   B2   C2   D2 34
43   A1   B1   C4   D2 43
46   A1   B2   C4   D2 46
47   A2   B2   C4   D2 47
58   A1   B2   C2   D3 58
59   A2   B2   C2   D3 59
62   A2   B1   C3   D3 62
65   A2   B2   C3   D3 65
67   A1   B1   C4   D3 67
72   A3   B2   C4   D3 72

As you can see the rows appear only once in one of the three dataframes. The data frames got the number of rows from your function.

Edit

Great suggestion by @GKi. You can also use the following code:

setNames(split(data_exp, sample(3, nrow(data_exp), TRUE)), c("data_1","data_2","data_3"))
Quinten
  • 35,235
  • 5
  • 20
  • 53
  • 1
    Maybe use instead of `sample(rep(1:3, r))` `sample(rep_len(1:3, nrow(data_exp)))` or better `sample(3, nrow(data_exp), TRUE)`. – GKi Apr 25 '22 at 09:20
  • @antonoyaro8, Edited my answer with great suggestion by GKi! It does the same with less code. – Quinten Apr 25 '22 at 10:46
3

A simple way to do it would be

data_list <- split(data_exp, 
                   sample(rep(c(1:3), 
                              diff(c(0, sort(sample(72, 2)), 72)))))

This generates three numbers (say, x, y and z) that add to 72 (third line). The rep generates a sequence of x 1s, y 2s and z 3s, that are interpreted as the factor levels for splitting the dataframe. sample shuffles this sequence of factors.

The output is a list of three dataframes, which you can easily convert to three separate dataframes with data_1 <- data_list[[1]] etc.

Note that this method of generating three numbers adding up to 72 will create quite a wide variation between runs, compared to some of the other answers here that use a method along the lines of sample(1:3, 72) which will tend to produce about 24 of each number, roughly normally distributed. You should think about what sort of distribution you need. The following simulations illustrate the difference...

summary(as.vector(sapply(1:1000, function(i) diff(c(0, sort(sample(72, 2)), 72)))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0      10      21      24      36      71 

summary(as.vector(sapply(1:1000, function(i) tabulate(sample(3, 72, replace = TRUE)))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     11      21      24      24      27      37 
Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
2

Here is another way of randomly splitting the data without sampling by modifying your code:

  1. Generate N random numbers, with N equals the number of rows of data_exp, and then assign its orders to an object. This object will contain the same numbers as the row numbers of data_exp, but with a random order.
set.seed(1) # This is set to enable reproduction of the result. 
rand = runif(nrow(data_exp))
orders = order(rand)
orders
# [1] 27 47 10 55 69 56 38 24 12 34  5 11 22 54 66  1 25 62 57 64 71
#[22] 30  2 19 28 14 26 60 40 53 63 48 51 67 31 33 16 58 45 44  3 32
#[43]  9 42 65 23  8 59 36 13 50 17 39 49 68 15 20 43 46 37 41 35 72
#[64] 52 29 70  6  4 61 21  7 18
  1. Use your rand_vect function to generate three numbers of which the sum is 72.
set.seed(1)
r = rand_vect(3,72)
r
#[1] 24 25 23
ind = cumsum(r)
ind
#[1] 24 49 72
  1. Subset data_exp by using r this way:
data_1 = data_exp[1:ind[1], ]
data_2 = data_exp[(ind[1]+1):ind[2], ]
data_3 = data_exp[(ind[2]+1):ind[3], ]

Check if the number of rows of data_1, data_2, and data_3 equals the element of r and check if there are common rows among them:

nrow(data_1)
#[1] 24
nrow(data_2)
#[1] 25
nrow(data_3)
#[1] 23
merge(data_1, data_2)
#[1] Var1 Var2 Var3 Var4 id  
#<0 rows> (or 0-length row.names)
merge(data_1, data_3)
#[1] Var1 Var2 Var3 Var4 id  
#<0 rows> (or 0-length row.names)
merge(data_2, data_3)
#[1] Var1 Var2 Var3 Var4 id  
#<0 rows> (or 0-length row.names)

Here is a function that implements the steps mentioned above:

rand_split = function(yourdf, nsplit, seed = 1){
#1. Generate a random order of row

set.seed(seed)

rand = runif(nrow(yourdf))
orders = order(rand)

#2. Set how many parts into which yourdf will be split. Each part has a random number of rows stored in r.

r   = rand_vect(nsplit, nrow(yourdf))

#3. Generate the indices that will be used to select the random order of row 
 
ind = cumsum(r)
n   = length(ind)
k   = c(1, ind[1:(n-1)] +1)
m   = Map(seq, k, ind)

#4. Split yourdf randomly, stored the resulted parts in a list, and name each element with data_1, data_2, and so on. 

outdf = lapply(m, function(x) data_exp[orders[x],])
names(outdf) = paste0("data_", seq_len(n))
output = c(nrow = r, outdf)
return(output)
}

This function can accept any nsplit and seed number. Example:

rand_split(data_exp, 5, 2)

$nrow1
[1] 13

$nrow2
[1] 14

$nrow3
[1] 15

$nrow4
[1] 15

$nrow5
[1] 15

$data_1
   Var1 Var2 Var3 Var4 id
51   A3   B1   C1   D3 51
31   A1   B1   C2   D2 31
52   A1   B2   C1   D3 52
70   A1   B2   C4   D3 70
20   A2   B1   C4   D1 20
43   A1   B1   C4   D2 43
7    A1   B1   C2   D1  7
30   A3   B2   C1   D2 30
27   A3   B1   C1   D2 27
40   A1   B2   C3   D2 40
24   A3   B2   C4   D1 24
44   A2   B1   C4   D2 44
32   A2   B1   C2   D2 32


# and more...
Abdur Rohman
  • 2,691
  • 2
  • 7
  • 12
2

Keeping your original idea (which is a nice idea, although using split as proposed by KU99 is the most efficient and simple way of proceeding I think):

the only missing step in your code is to remove the rows you are sampling, so you do not sample them later.

Using the code you gave, you could do:

r = rand_vect(3, 72)

rowlist = 1:nrow(data_exp)

lapply(seq(r),function(i){

  chosen_rows <- sample(rowlist,r[i])  
  rowlist <<- rowlist[!rowlist %in% chosen_rows]
  data_exp[chosen_rows,]
})

To obtain the list of your 3 subsets.

Here, for each subsetting, I remove the chosen lines from the lines to be sampled for the next subset (rowlist <<- rowlist[!rowlist %in% chosen_id]). This way, you do not sample lines that were already sampled.

denis
  • 5,580
  • 1
  • 13
  • 40