1

Data Example Created

date = seq(as.Date("2019/01/01"), by = "month", length.out = 48)
productB = rep("B",48)
productB = rep("B",48)
productA = rep("A",48)
productA = rep("A",48)
subproducts1=rep("1",48)
subproducts2=rep("2",48)
subproductsx=rep("x",48)
subproductsy=rep("y",48)
b1 <- c(rnorm(30,5), rep(0,18))
b2 <- c(rnorm(30,5), rep(0,18))
b3 <-c(rnorm(30,5), rep(0,18))
b4 <- c(rnorm(30,5), rep(0,18))

Created the dataframe

dfone <- data.frame("date"= rep(date,4),
         "product"= c(rep(productB,2),rep(productA,2)),
         "subproduct"= c(subproducts1,subproducts2,subproductsx,subproductsy),
         "actuals"= c(b1,b2,b3,b4))

How can I create list of time series with train/test split based off subproducts on the above dataframe? There is 192 rows, and 4 subproducts, so 48 rows per subproduct implying 4 time series but I want 8 elements in a list because of train and test split.

Edit:

for(i in unique(dfone$subproduct)) {
    nam <- paste("df", i, sep = ".")
    assign(nam, dfone[dfone$subproduct==i,])
}

list_df <- list(df.1,df.2,df.x,df.y) %>%
lapply( function(x) x[(names(x) %in% c("date", "actuals"))])

for (i in 1:length(list_df)) {
   assign(paste0("df", i), as.data.frame(list_df[[i]]))
  }
combined_dfs <- merge(merge(merge(df1, df2, by='date', all=T), df3, 
by='date', all=T),df4,by="date",all=T)
colnames(combined_dfs) <-  
c("date","actualB1","actualB2","actualAx","actualAy")


list_ts <- lapply(combined_dfs, function(t) 
ts(t,start=c(2019,1),end=c(2021,6), frequency = 12)) %>%
              lapply( function(t) ts_split(t,sample.out= 
(0.2*length(t))))    # creates my train test split
list_ts <- do.call("rbind", list_ts)  #Creates a list of time series

Above is pretty much what I want, however is there an easier way to do the merge(merge() part?

chriswang123456
  • 435
  • 2
  • 10

1 Answers1

2

Something like this?

library(dplyr)
train_frac = 0.8
dfone_split <- dfone %>%
  mutate(set = sample(c("train", "test"), n(), replace = TRUE, 
                      prob = c(train_frac, 1 - train_frac)))
dfone_train <- dfone_split %>% filter(set == "train")
dfone_test <- dfone_split %>% filter(set == "test")

You might also take a look at the rsample package which offers a wide variety of ways to control your splits, e.g. using time sampling within windows.

Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • Hey, I've updated my post above. The only thing I would want to change is the combined_dfs part. Is there a way to change code so I don't need to continuously add merge() if I have say 100 unique subproducts? – chriswang123456 Jun 28 '21 at 23:02
  • That seems like a different question and I don't see enough info in your post to be able to reproduce it. I have seen similar questions answered like this one: https://stackoverflow.com/questions/14096814/merging-a-lot-of-data-frames – Jon Spring Jun 28 '21 at 23:07
  • Opps, I added the function that is necessary to make that code chunk run. The problem is it's not any better than the solution I currently have. Making things automated/less hard code in R is challenging – chriswang123456 Jun 28 '21 at 23:37
  • https://rsample.tidymodels.org/reference/initial_split.html – Jon Spring Jun 29 '21 at 03:00