R - sample and resample a person-period file

Question

I am working with a gigantic person-period file and I thought that a good way to deal with a large dataset is by using sampling and re-sampling technique.

My person-period file look like this

   id code time
1   1    a    1
2   1    a    2
3   1    a    3
4   2    b    1
5   2    c    2
6   2    b    3
7   3    c    1
8   3    c    2
9   3    c    3
10  4    c    1
11  4    a    2
12  4    c    3
13  5    a    1
14  5    c    2
15  5    a    3

I have actually two distinct issues.

The first issue is that I am having trouble in simply sampling a person-period file.

For example, I would like to sample 2 id-sequences such as :

  id code time
   1    a    1
   1    a    2
   1    a    3
   2    b    1
   2    c    2
   2    b    3

The following line of code is working for sampling a person-period file

dt[which(dt$id %in% sample(dt$id, 2)), ]

However, I would like to use a dplyr solution because I am interested in resampling and in particular I would like to use replicate.

I am interested in doing something like replicate(100, sample_n(dt, 2), simplify = FALSE)

I am struggling with the dplyr solution because I am not sure what should be the grouping variable.

library(dplyr)
dt %>% group_by(id) %>% sample_n(1)

gives me an incorrect result because it does not keep the full sequence of each id.

Any clue how I could both sample and re-sample person-period file ?

data

dt = structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"
), class = "factor"), code = structure(c(1L, 1L, 1L, 2L, 3L, 
2L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 1L), .Label = c("a", "b", 
"c"), class = "factor"), time = structure(c(1L, 2L, 3L, 1L, 2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2", 
"3"), class = "factor")), .Names = c("id", "code", "time"), row.names = c(NA, 
-15L), class = "data.frame")

I also updated with a replicate version in dplyr – akrun Aug 10 '16 at 17:31 — akrun, Aug 10 '16 at 17:31

Frank · Accepted Answer · 2016-08-10T16:44:54.053

4

I think the idiomatic way would probably look like

set.seed(1)
samp = df %>% select(id) %>% distinct %>% sample_n(2)
left_join(samp, df)

  id code time
1  2    b    1
2  2    c    2
3  2    b    3
4  5    a    1
5  5    c    2
6  5    a    3

This extends straightforwardly to more grouping variables and fancier sampling rules.

If you need to do this many times...

nrep = 100
ng   = 2
samps = df %>% select(id) %>% distinct %>% 
  slice(rep(1:n(), nrep)) %>% mutate(r = rep(1:nrep, each = n()/nrep)) %>%
  group_by(r) %>% sample_n(ng)
repdat = left_join(samps, df)

# then do stuff with it:
repdat %>% group_by(r) %>% do_stuff

edited Aug 10 '16 at 16:44

answered Aug 10 '16 at 16:26

Frank

66,179
8
96
180

Thanks good point. However, I am trying to `replicate` sampling, which works with something like `replicate(100, sample_n(dt, 2), simplify = FALSE)` but not with longer pip sequences. – giac Aug 10 '16 at 16:28
1

@giacomoV Ok, updated. I think you should put that into the question itself. – Frank Aug 10 '16 at 16:45

akrun · Answer 2 · 2016-08-10T17:08:14.477

We can use filter with sample

dt %>%
    filter(id %in% sample(unique(id),2, replace = FALSE))

NOTE: The OP specified using dplyr method and this solution does uses the dplyr.

If we need to do replicate one option would be using map from purrr

library(purrr)
dt %>% 
    distinct(id) %>% 
    replicate(2, .) %>%
    map(~sample(., 2, replace=FALSE)) %>%
    map(~filter(dt, id %in% .))
#$id
#  id code time
#1  1    a    1
#2  1    a    2
#3  1    a    3
#4  4    c    1
#5  4    a    2
#6  4    c    3

#$id
#  id code time
#1  4    c    1
#2  4    a    2
#3  4    c    3
#4  5    a    1
#5  5    c    2
#6  5    a    3

score 2 · Answer 3 · answered Aug 10 '16 at 16:35

I imagine you are doing some simulations and may want to do the subsetting many times. You probably also want to try this data.table method and utilize the fast binary search feature on the key column:

library(data.table)
setDT(dt)
setkey(dt, id)
replicate(2, dt[list(sample(unique(id), 2))], simplify = F)

#[[1]]
#   id code time
#1:  3    c    1
#2:  3    c    2
#3:  3    c    3
#4:  5    a    1
#5:  5    c    2
#6:  5    a    3

#[[2]]
#   id code time
#1:  3    c    1
#2:  3    c    2
#3:  3    c    3
#4:  4    c    1
#5:  4    a    2
#6:  4    c    3

R - sample and resample a person-period file

3 Answers3

Linked