R- random sample of groups in a data.table

Question

How can I randomly sample e.g. three groups within a data.table so that the result contains three groups with all rows from the original data.table?

library(data.table)
dat <- data.table(ids=1:20, groups=sample(x=c("A","B","C", "D", "E", "F"), 20, 
replace=TRUE))

I know how to select 10 rows randomly from a data.table:

dat.sampl1 <- as.data.table(sapply(dat[], sample, 10))

And also how to sample by group

dat[,.SD[sample(.N, min(.N,3))], by = groups]

But how to randomly sample groups? So the result should look like:

I don't understand what you are asking. If you are going to use `sample()`, then use `set.seed()` so your data is reproducible. It seems you have some constraint so it's not a simple random sample. Is this some sort of conditional sampling perhaps? — MrFlick, May 14 '18 at 18:03
Does this answer your question? [Sample random rows within each group in a data.table](https://stackoverflow.com/questions/16289182/sample-random-rows-within-each-group-in-a-data-table) — Union find, Jul 07 '23 at 19:42

Weihuang Wong · Answer 1 · 2018-05-14T18:57:49.903

5

Do you mean something like:

set.seed(123)
dat <- data.table(ids=1:20, groups=sample(x=c("A","B","C", "D", "E", "F"), 20, 
replace=TRUE))
dat[groups %in% sample(unique(dat[, groups]), size = 3)][order(groups)]
#     ids groups
#  1:   3      C
#  2:  10      C
#  3:  12      C
#  4:   7      D
#  5:   9      D
#  6:  14      D
#  7:   4      F
#  8:   5      F
#  9:   8      F
# 10:  11      F
# 11:  16      F
# 12:  20      F

If you want to sample groups with replacement, you can do the following, where A has been sampled twice:

dat[unique(dat[, list(groups)])[sample(.N, 3, replace = TRUE)], on = "groups"]
#    ids groups
# 1:   3      C
# 2:  10      C
# 3:  12      C
# 4:   6      A
# 5:  15      A
# 6:  18      A
# 7:   6      A
# 8:  15      A
# 9:  18      A

edited May 14 '18 at 18:57

answered May 14 '18 at 18:32

Weihuang Wong

12,868
2
27
48

Perfect! That was exactly I was locking for. Thanks a lot. – user2147915 May 15 '18 at 06:15
Great - if the answer solved your problem, please accept it so we can mark the question as resolved. – Weihuang Wong May 15 '18 at 11:20

rg255 · Answer 2 · 2018-05-15T06:21:33.357

This code works, using a single line of base R code using %in% to check an index which is generated using the sample function:

df1[df1[,'groups'] %in% sample(unique(df1[,'groups']), size = 3, replace = F), ]

For example:

> df1 <- data.frame("ids" = 1:20, "groups" = sample(LETTERS[1:4], size = 20, replace = T))
> df2 <- df1[df1[,'groups'] %in% sample(unique(df1[,'groups']), size = 3, replace = F), ]
> df2[order(df2[,'groups']),]
   ids groups
4    4      B
6    6      B
18  18      B
20  20      B
1    1      C
2    2      C
3    3      C
9    9      C
12  12      C
16  16      C
19  19      C
7    7      D
11  11      D

R- random sample of groups in a data.table

2 Answers2