I have the following dataset:
df<- as.data.frame(c(rep("a", times = 9), rep("b", times = 18), rep("c", times = 27)))
colnames(df)<-"Location"
Year<-c(rep(1:3,times = 3), rep(1:6, times = 3), rep(1:9, times = 3))
df$Year<-Year
df<- df %>%
mutate(Predictor = seq_along(Location)) %>%
ungroup(df)
print(df)
Location Year Predictor
a 1 1
a 2 2
a 3 3
a 1 4
a 2 5
a 3 6
a 1 7
a 2 8
a 3 9
b 1 10
b 2 11
b 3 12
b 4 13
b 5 14
... 40 more rows
I want to split the above dataframe into training and test sets. For the test set, I want to randomly sample a third of the number of years in each Location, while keeping the years together. So if year "1" is selected for location "a", I want all three "1's" in the test set and so on. My test set should look something like this:
Location Year Predictor
a 1 1
a 1 4
a 1 7
b 3 12
b 3 18
b 3 24
b 5 14
b 5 20
b 5 26
c 3 30
c 3 39
c 3 48
c 6 33
c 6 42
c 6 51
c 7 34
c 7 43
c 7 52
I found a similar question here, but this procedure would sample the same year and the same number of years from every location (and YEAR is numeric, not a factor). I want a different random sample of years from each location and a proportional number of samples.
Would like to do this in dplyr if possible