R - stratified sampling for Person Period file

Question

Following up this question, I wondered how I can effectively sample a stratified Person Period file.

I have a database who looks like this

    id time var  clust
 1:  1    1   a clust1
 2:  1    2   c clust1
 3:  1    3   c clust1
 4:  2    1   a clust1
 5:  2    2   a clust1
...

With individuals id grouped into clusters clust. What I would like is to sample id by clust, keeping the person period format.

The solution I came up with is to sample id and then to merge back. However, is it not a very elegant solution.

library(data.table) 
library(dplyr) 

setDT(dt) 

dt[,.SD[sample(.N,1)],by = clust] %>% 
  merge(., dt, by = 'id')

which gives

   id clust.x time.x var.x time.y var.y clust.y
1:  2  clust1      1     a      1     a  clust1
2:  2  clust1      1     a      2     a  clust1
3:  2  clust1      1     a      3     c  clust1
4:  3  clust2      3     c      1     a  clust2
5:  3  clust2      3     c      2     b  clust2
6:  3  clust2      3     c      3     c  clust2
7:  5  clust3      1     a      1     a  clust3
8:  5  clust3      1     a      2     a  clust3
9:  5  clust3      1     a      3     c  clust3

Is there a more straightforward solution ?

library(data.table)
dt = setDT(structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), .Label = c("1", "2", 
"3", "4", "5", "6"), class = "factor"), time = structure(c(1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
 3L), .Label = c("1", "2", "3"), class = "factor"), var = structure(c(1L, 
3L, 3L, 1L, 1L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 1L, 3L, 2L, 2L, 
3L), .Label = c("a", "b", "c"), class = "factor"), clust = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 
2L), .Label = c("clust1", "clust2", "clust3"), class = "factor")), .Names =  c("id", 
 "time", "var", "clust"), row.names = c(NA, -18L), class = "data.frame"))

A common option (from the dupe I'm about to mark): `dt[ dt[, sample(.I, 1), by=clust]$V1 ]` — Frank, Oct 23 '16 at 16:27

Psidom · Accepted Answer · 2016-10-23T18:09:27.627

3

Here is a variant following @Frank's comment that might help, essentially you can sample a unique id from each clust group and find out the corresponding index number with .I for subsetting:

dt[dt[, .I[id == sample(unique(id),1)], clust]$V1]

#   id time var  clust
#1:  2    1   a clust1
#2:  2    2   a clust1
#3:  2    3   c clust1
#4:  3    1   a clust2
#5:  3    2   b clust2
#6:  3    3   c clust2
#7:  4    1   a clust3
#8:  4    2   b clust3
#9:  4    3   c clust3

edited Oct 23 '16 at 18:09

answered Oct 23 '16 at 17:24

Psidom

209,562
33
339
356

but strangely when I use it on my data I have a warning `In hldid == sample(unique(hldid), 10) : longer object length is not a multiple of shorter object length`. I wonder why. I have a lots of cases by clusters. – giac Oct 23 '16 at 18:23
1

`==` works if you want to sample one id per group, if you want to sample multiple ids, you need `%in%`. `dt[dt[, .I[id %in% sample(unique(id),10)], clust]$V1]` – Psidom Oct 23 '16 at 18:24
1

Yeah, I think this is a nice way to do it! – Frank Oct 24 '16 at 03:40

score 2 · Answer 2 · answered Oct 23 '16 at 17:09

I think tidy data here would have an ID table where cluster is an attribute:

idDT = unique(dt[, .(id, clust)])


   id  clust
1:  1 clust1
2:  2 clust1
3:  3 clust2
4:  4 clust3
5:  5 clust3
6:  6 clust2

From there, sample...

my_selection = idDT[, .(id = sample(id, 1)), by=clust]

and merge or subset

dt[ my_selection, on=names(my_selection) ]
# or 
dt[ id %in% my_selection$id ]

I would keep the intermediate table my_selection around, expecting it to come in handy later.

I like the solution, but I am still bothered to have to create several vectors. Thanks — giac, Oct 23 '16 at 18:24

R - stratified sampling for Person Period file

2 Answers2