Sample a data set of 10,000 rows into unique sets of 100 based on probability of a particular column value

Question

I have a dataset of 20 rows with 4 columns A,B,C,D. [simplified data set]

Original data set:

>data
ID Name Age Type
1  ABC  23   A
2  CDE  34   A
3  ABCE  23   C
4  CDEYU  34   B 
5  ABCW  23   A
6  CDEDR  34   B 
7  ASER  23   A
8  CDEAW  34   B 
9  ABCHKJ  23   A
10  CDEFDE  34   C 
11  ABCDDD  23   A
12  CDEDDD  34   A
13  ABCEDDD  23   C
14  CDEYUDDD  34   B 
15  ABCWDDD  23   A
16  CDEDRDDD  34   B 
17  ASERDDD  23   A
18  CDEAWDDD  34   B 
19  ABCHKJDDD  23   A    
20  CDEFDEDDD  34   C

Here the "Type" column is distributed in such a way that probabilities of A,B,C is (0.5, 0.3, 0.2) respectively.

Now, I want to cut two unique sets of 10 each, so that each set will have 10 rows with the same probability distribution.

Can I use the sample function to achieve this purpose?

Something like this:

sample(data, 10, replace=F, prob((data$Type="A")=0.5,(data$Type="B")=0.3,(data$Type="C")=0.2))

Also, how do I write a loop to get this continuously for a big set of 100 rows? I mean 10 sets from a dataset of 100 rows.

Expected Output:

Dataset 1:

ID Name Age Type
1  ABC  23   A
2  CDE  34   A
3  ABCE  23   C
4  CDEYU  34   B 
5  ABCW  23   A
6  CDEDR  34   B 
7  ASER  23   A
8  CDEAW  34   B 
9  ABCHKJ  23   A
10  CDEFDE  34   C

Dataset 2:

ID Name Age Type
1  ABCDDD  23   A
2  CDEDDD  34   A
3  ABCEDDD  23   C
4  CDEYUDDD  34   B 
5  ABCWDDD  23   A
6  CDEDRDDD  34   B 
7  ASERDDD  23   A
8  CDEAWDDD  34   B 
9  ABCHKJDDD  23   A
10  CDEFDEDDD  34   C

Any help in this regard would be greatly appreciated.

Please read about how to post your example data in [a reproducible format](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — Thomas, Feb 24 '14 at 09:27
Also, I note that you haven't accepted any answers to previous questions you have posted. It is good practice [to accept answers](http://stackoverflow.com/help/accepted-answer) because it helps organize questions on the site and gives you a small (+2) reputation gain. — Thomas, Feb 24 '14 at 09:28
@Thomas: Thanks for the inputs. I have accepted the answers and organised them properly. — N2M, Feb 24 '14 at 09:53
Do you want each subset to have 5 x `A`, 3 x `B`, and 2 x `C`, or do you want the probability of drawing an `A`, `B` or a `C` to be constant across subsets, i.e. perhaps leading to datasets that have varying distributions of `A,B,C`, but which, on average, comprise 5 x `A`, 3 x `B` and 2 x `C`? — jbaums, Feb 24 '14 at 10:01

jbaums · Accepted Answer · 2014-02-24T10:34:32.980

Here is one way to achieve what I believe you intend to do:

d <- data.frame(id=1:100,
                type=sample(unlist(mapply(rep, c('A', 'B', 'C'), 
                                          c(50, 30, 20), USE.NAMES=F))),
                group=NA)

d <- within(d, {
  group[which(type=='A')] <- sample(gl(10, 5))
  group[which(type=='B')] <- sample(gl(10, 3))
  group[which(type=='C')] <- sample(gl(10, 2))
})


foo <- split(d[, 1:2], d$group) 
# above, adjust 1:2 to reflect which columns you want 
#  to include in the split data.frames.

foo[1:2] # First 2 (of 10) elements

$`1`
     id type
20   20    A
31   31    C
34   34    C
37   37    A
42   42    A
52   52    B
60   60    A
74   74    B
77   77    A
100 100    B

$`2`
   id type
1   1    C
17 17    C
27 27    A
46 46    B
57 57    B
58 58    A
62 62    B
71 71    A
72 72    A
89 89    A

Each element of list foo has 5 x A, 3 x B, and 2 x C. This is achieved by identifying the indices corresponding to each type in turn (using which), then assigning permuted group numbers 1 through 10 (with the number of repetitions corresponding to your desired distribution). Finally, split is used to split the data.frame to a list of data.frames.

To generalise this solution to a dataset with 10,000 rows, with 100 rows in each subset, simply adjust the arguments to gl, e.g. group[which(type=='A')] <- sample(gl(100, 50)) (if there are 5000 A in the large dataset).

Thank you so much. This is the right way to go about it. and yes, this is exactly what I wanted. Learnt about within, which, gl. Cheers :) — N2M, Feb 24 '14 at 11:02

Sample a data set of 10,000 rows into unique sets of 100 based on probability of a particular column value

1 Answers1