0

My dataset contains many IDs with at least 100 observations per ID, with one observation per date. There is overlap in dates among IDs. See fake dataset below with 10 IDs:

id = 10
m   = 2 * id
a_0 = seq(as.Date("2001-01-01"), as.Date("2001-12-31"), by = "day")
a_1 = matrix(sort(sample(as.character(a_0), m)), nc = 2)
a_2 = list()
for(i in 1:nrow(a_1)){
 a_3 = seq(as.Date(a_1[i, 1]), as.Date(a_1[i, 2]), by = "day")
 a_4 = data.frame(i, as.character(a_3), round(runif(length(a_3), 1)))
 colnames(a_4) = c("id", "date", "value")
 a_2[[i]] = a_4
}
DF = dplyr::bind_rows(a_2)
dim(DF)
table(DF[, 1])

For each ID, I would like to randomly sample consecutive observations over a fix number of days, something similar to what has been asked here: Sample n consecutive dates from a random starting date for each index in a data frame. So, something like that (e.g., with 10 consecutive days):

library(dplyr)

df.sample <- arrange(DF, date) %>% 
 group_by(id) %>% 
 mutate(date = as.Date(date), start = sample(date, 1)) %>% 
 filter(date >= start & date <= (start + 9))

However, I need to randomly sample different time periods for each ID: 2 x 10 days and 1 x 25 days. Also, the time periods sampled cannot overlap with each other within an ID , i.e. the same date cannot be sampled twice for the same ID.

On top of that, the first and last observation of each ID should not be sampled. Finally, there should always be at least 1 observation between the time periods sampled.

I struggle to find a simple solution that would include all these constraints. Some help would be much appreciated.

Don-Jean
  • 35
  • 4
  • *On top of that, the first and last observation of each ID should not be sampled.* - that part is easy. Drop the first and last observation for each ID before any sampling. – Gregor Thomas Sep 29 '20 at 18:47
  • After that, perhaps you could do something like 1) randomly determine the order of samples (i.e., 10,10,25; 10,25,10; or 25,10,10), 2) calculate the number of non-sampled and non-interim days `non_sampled = n() - 45 - 2`, 3) partition `non_sampled` days into 4 parts for before, after, and between the sample days (maybe use the `partitions` package [somewhat like this](https://stackoverflow.com/q/31386328/903061), 4) calculate your sampled windows. – Gregor Thomas Sep 29 '20 at 19:00
  • (Can't really tell if the `partitions` package offers this functionality... you might need to find a different package/method to randomly partition an integer) – Gregor Thomas Sep 29 '20 at 19:04

1 Answers1

1

Here is some dummy data which contains the sample size for each id.

# Get 1 sample from the first group, 5 from the second etc.
sample_sizes <- tribble(
  ~id, ~sample_size,
  1,            1,
  2,            5,
  3,            2,
  4,            3,
  5,            3,
  6,            3,
  7,            3,
  8,            3,
  9,            3,
  10,           3
)

Group the data frame and calculate the start and end rows for the samples.

sample.int(n() - sample_size - 1, 1) + 1 will give a number between 2 and n() - sample_size. This is the row of the first included observation. This starting point will prevent the first or last observations being included.

DF %>%
  inner_join(sample_sizes) %>%
  group_by(id) %>%
  mutate(
    start_row = sample.int(n() - sample_size - 1, 1) + 1,
    end_row = start_row + sample_size
  ) %>%
  filter(
    row_number() >= start_row & row_number() < end_row
  )
#> # A tibble: 29 x 6
#> # Groups:   id [10]
#> id date       value sample_size start_row end_row
#> <dbl> <chr>      <dbl>       <dbl>     <dbl>   <dbl>
#>   1     1 2001-02-25     1           1        24      25
#> 2     2 2001-02-06     1           5         3       8
#> 3     2 2001-02-07     1           5         3       8
#> 4     2 2001-02-08     1           5         3       8
#> 5     2 2001-02-09     1           5         3       8
#> 6     2 2001-02-10     1           5         3       8
#> 7     3 2001-03-30     1           2        37      39
#> 8     3 2001-03-31     1           2        37      39
#> 9     4 2001-05-21     1           3        70      73
#> 10     4 2001-05-22     1           3        70      73
#> # ... with 19 more rows
Paul
  • 8,734
  • 1
  • 26
  • 36