Data partitioning based on timeseries and groups in R

Question

I need to split a large dataframe of meterological timeseries into a training and validation samples. It contains data from multiple stations, which have varying period of observations. How could I divide it so that proportion of training and validation observations is equal across each station. Given the following dataset:

Station	Date	temp
A	2012-01-01	-0.8
A	2012-01-02	0.1
A	2012-01-03	0.5
A	2012-01-04	0.4
B	2012-01-01	0.1
B	2012-01-02	0.5

and assuming that the training set should include only first 50% of the observations from each station, the desired output would be:

Station	Date	temp
A	2012-01-01	-0.8
A	2012-01-02	0.1
B	2012-01-01	0.1

Please do not post photos of data or code! If you do, people who are willing to help you would have to type out all that text. Instead proved a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) P.S. Here is [a good overview on how to ask a good question](https://stackoverflow.com/help/how-to-ask) — dario, Sep 28 '21 at 15:22
Does this answer your question? [Stratified random sampling from data frame](https://stackoverflow.com/questions/23479512/stratified-random-sampling-from-data-frame) — dario, Sep 28 '21 at 15:23
@dario, thanx for the link, but it uses random/stratified partitioning whereas my question suggests that observations should be extracted as continuous sub-periods. Re your first comment, its not a code/prhoto, but I do accept that reproducible example is more appropriate. Sorry, Im a newbie — tabumis, Sep 28 '21 at 15:34

score 1 · Accepted Answer · answered Sep 28 '21 at 16:12

Given your example you could use slice_head from dplyr. For creating the validation, remove the records that are in training. This to avoid selecting duplictates in case there is an uneven number of records for a station.

training <- df1 %>% 
  mutate(Date = as.Date(Date),
         id = row_number()) %>% 
  group_by(Station) %>% 
  slice_head(prop = 0.5)
  
validation <- df1 %>% 
  mutate(Date = as.Date(Date),
         id = row_number()) %>%
  filter(!id %in% training$id)

training
# A tibble: 3 x 4
# Groups:   Station [2]
  Station Date        temp    id
  <chr>   <date>     <dbl> <int>
1 A       2012-01-01  -0.8     1
2 A       2012-01-02   0.1     2
3 B       2012-01-01   0.1     5

validation
  Station       Date temp id
1       A 2012-01-03  0.5  3
2       A 2012-01-04  0.4  4
3       B 2012-01-02  0.5  6

This works well, many thanx @phiver! – tabumis Sep 28 '21 at 16:42 — tabumis, Sep 28 '21 at 16:42

Data partitioning based on timeseries and groups in R

1 Answers1