I need to split a large dataframe of meterological timeseries into a training and validation samples. It contains data from multiple stations, which have varying period of observations. How could I divide it so that proportion of training and validation observations is equal across each station. Given the following dataset:
Station | Date | temp |
---|---|---|
A | 2012-01-01 | -0.8 |
A | 2012-01-02 | 0.1 |
A | 2012-01-03 | 0.5 |
A | 2012-01-04 | 0.4 |
B | 2012-01-01 | 0.1 |
B | 2012-01-02 | 0.5 |
and assuming that the training set should include only first 50% of the observations from each station, the desired output would be:
Station | Date | temp |
---|---|---|
A | 2012-01-01 | -0.8 |
A | 2012-01-02 | 0.1 |
B | 2012-01-01 | 0.1 |