3

I'm trying (for the first time) to add an external variable to prophet with the add_regressor function, but the results I'm getting look wild. The dataset I'm using is freely available on kaggle (the well known shampoo sales) here. I'm attempting to use freely available data for the SPY stock index using R's quantmod package as my external variable.

Here's how I start the code:

library(prophet)
library(quantmod)
library(dplyr)


df <- read.csv("~/shampoo.csv")


#now get the min and max dates in the column
min_date <- min(df$Date, na.rm = TRUE)
max_date <- max(df$Date, na.rm = TRUE)


#download the SPY stock data
getSymbols("SPY", from  = min_date, to = max_date)

#SPY closes stored into a df and massage a bit
Close <- data.frame(Cl(SPY))
Close <- cbind(ds = rownames(Close), Close)
rownames(Close) <- NULL
Close_no_rename <- Close
colnames(Close)[colnames(Close) == 'SPY.Close'] <- 'y'
colnames(Close_no_rename)[colnames(Close_no_rename) == 'SPY.Close'] <- 'SPY_CLOSE'


#now put this into prophet and make a forecast for the forecast_period for SPY
stock_model <- prophet(Close)


#make a forecast dataframe
future_stocks <- make_future_dataframe(stock_model, periods = 30, freq = "month", include_history = FALSE)

#the below df will have predicted stock prices of the SPY. want to extract the future y values as point forecast along with dates
forecast <- predict(stock_model, future_stocks) %>% select(ds, yhat)
colnames(forecast)[colnames(forecast) == 'yhat'] <- 'SPY_CLOSE'


#rename the columns of the actual df
colnames(df)[colnames(df) == 'Date'] <- 'ds'
colnames(df)[colnames(df) == 'Value'] <- 'y'

#now want to merge the Close df y historic values onto the training df, merge by date ds column
df_historic_with_SPY_close <- merge(df, Close_no_rename, by = "ds")


#now actually forecast using prophet
model <- prophet()

#add the SPY regressor
model <- add_regressor(model, 'SPY_CLOSE', prior.scale = 0.0000001, standardize = FALSE)
model <- fit.prophet(model, df_historic_with_SPY_close)


forecast_final <- predict(model, forecast)

plot(model, forecast_final)

This does not throw any errors but the plot of the forecast looks...wrong. It looks as if the scale is off or something. I tried fiddling with the prior and standardize settings with no luck. Thanks for any help! enter image description here

Here is the shampoo dataset being used as the main variable:

Date    Value
2017-01-01  266
2017-02-01  145.9
2017-03-01  183.1
2017-04-01  119.3
2017-05-01  180.3
2017-06-01  168.5
2017-07-01  231.8
2017-08-01  224.5
2017-09-01  192.8
2017-10-01  122.9
2017-11-01  336.5
2017-12-01  185.9
2018-01-01  194.3
2018-02-01  149.5
2018-03-01  210.1
2018-04-01  273.3
2018-05-01  191.4
2018-06-01  287
2018-07-01  226
2018-08-01  303.6
2018-09-01  289.9
2018-10-01  421.6
2018-11-01  264.5
2018-12-01  342
2019-01-01  339.7
2019-02-01  440.4
2019-03-01  315.9
2019-04-01  439.3
2019-05-01  401.3
2019-06-01  437.4
2019-07-01  575.5
2019-08-01  407.6
2019-09-01  682
2019-10-01  475.3
2019-11-01  581.3
2019-12-01  646.9
Daniel V
  • 1,305
  • 7
  • 23
DeathbyGreen
  • 337
  • 2
  • 17
  • Could you elaborate on what exactly you mean by wrong? I am assuming the black dots are actual data, while the blue line is predicted? In that case, should the blue line be crossing zero? – Grubbmeister Dec 05 '19 at 00:19
  • Perhaps `model <- prophet(yearly.seasonality = F)` ? – cuttlefish44 Dec 05 '19 at 02:27
  • @Grubbmeister when i make a univariate forecast the massive increase in range is not present; including the data from the market seems to have dramatically decreased performance to the point it seems there was an error in the code – DeathbyGreen Dec 05 '19 at 02:58
  • @cuttlefish44 that doesnt seem to have an effect; either way the shampoo sales dataset exhibits some form of seasonality – DeathbyGreen Dec 05 '19 at 03:02
  • @DeathbyGreen the line `df_historic_with_SPY_close <- merge(df, Close_no_rename, by = "ds")` is not working for me. It creates a dataframe with no rows. – Grubbmeister Dec 05 '19 at 03:47
  • Also, @DeathbyGreen so everyone's on the same page, could you provide your shampoo data as a dput in the question? I had to modify the dates, so I don't know if what I have matches what you have. – Grubbmeister Dec 05 '19 at 03:52
  • added the dataset – DeathbyGreen Dec 05 '19 at 04:15

1 Answers1

1

I think I have fixed things, but the only thing I did differently was change the dates from factor format to date format, and tell R to use the select function from dplyr. I also ran R without any other packages loaded. So, it's still a bit of a mystery why this worked.

I tumbled to this issue when the line

df_historic_with_SPY_close <- merge(df, Close_no_rename, by = "ds")

didn't work properly. I discovered it was because I had formatted the df dates as dates to get them to work with getSymbols but then they were different from Close_no_rename.

First, the data I used:

df<-dput(df)
structure(list(ds = structure(c(17167, 17198, 17226, 17257, 17287, 
17318, 17348, 17379, 17410, 17440, 17471, 17501, 17532, 17563, 
17591, 17622, 17652, 17683, 17713, 17744, 17775, 17805, 17836, 
17866, 17897, 17928, 17956, 17987, 18017, 18048, 18078, 18109, 
18140, 18170, 18201, 18231), class = "Date"), y = c(266, 145.9, 
183.1, 119.3, 180.3, 168.5, 231.8, 224.5, 192.8, 122.9, 336.5, 
185.9, 194.3, 149.5, 210.1, 273.3, 191.4, 287, 226, 303.6, 289.9, 
421.6, 264.5, 342.3, 339.7, 440.4, 315.9, 439.3, 401.3, 437.4, 
575.5, 407.6, 682, 475.3, 581.3, 646.9)), row.names = c(NA, -36L
), class = "data.frame")


    library(prophet)
    library(quantmod)
    library(dplyr)
    
# can use your df, rather than above
    df<-read.csv("~/shampoo.csv")
# either way, should run this
    df$Date<-as.Date.factor(df$Date,tryFormats = c("%d-%m-%y"))
    str(df) #check
    
    #now get the min and max dates in the column
    min_date <- min(df$Date, na.rm = TRUE)
    max_date <- max(df$Date, na.rm = TRUE)
    
    
    #download the SPY stock data
    getSymbols("SPY", from  = min_date, to = max_date)
    
    #SPY closes stored into a df and massage a bit
    Close <- data.frame(Cl(SPY))
    Close <- cbind(ds = rownames(Close), Close)
    rownames(Close) <- NULL
    Close_no_rename <- Close
    colnames(Close)[colnames(Close) == 'SPY.Close'] <- 'y'
    colnames(Close_no_rename)[colnames(Close_no_rename) == 'SPY.Close'] <- 'SPY_CLOSE'

# make dates in date format 

    Close_no_rename$ds<-as.Date(Close_no_rename$ds)
    str(Close_no_rename)
    
    #now put this into prophet and make a forecast for the forecast_period for SPY
    stock_model <- prophet(Close)
    
    
    #make a forecast dataframe
    future_stocks <- make_future_dataframe(stock_model, periods = 30, freq = "month", include_history = FALSE)
    
    #the below df will have predicted stock prices of the SPY. want to extract the future y values as point forecast along with dates
# specify dplyr:::select
    forecast <- predict(stock_model, future_stocks) %>% dplyr:::select(ds, yhat)
    colnames(forecast)[colnames(forecast) == 'yhat'] <- 'SPY_CLOSE'
    
    
    #rename the columns of the actual df
    colnames(df)[colnames(df) == 'Date'] <- 'ds'
    colnames(df)[colnames(df) == 'Value'] <- 'y'
    
    #now want to merge the Close df y historic values onto the training df, merge by date ds column
    df_historic_with_SPY_close <- merge(df, Close_no_rename, by = "ds")
    df_historic_with_SPY_close
    
    #now actually forecast using prophet
    model <- prophet()
    
    #add the SPY regressor
    model <- add_regressor(model, 'SPY_CLOSE', prior.scale = 0.0000001, standardize = FALSE)
    model <- fit.prophet(model, df_historic_with_SPY_close)
    
    
    forecast_final <- predict(model, forecast)
    
    plot(model, forecast_final)

Result:

enter image description here

EDIT

Using the following dataset, with dates changed to one that is close in the SPY dataset:

df<-dput(df)
structure(list(ds = structure(c(17169, 17198, 17226, 17259, 17287, 
17318, 17350, 17379, 17410, 17442, 17471, 17501, 17534, 17563, 
17591, 17624, 17652, 17683, 17715, 17744, 17778, 17805, 17836, 
17868, 17898, 17928, 17956, 17987, 18017, 18050, 18078, 18109, 
18142, 18170, 18201, 18232), class = "Date"), y = c(266, 145.9, 
183.1, 119.3, 180.3, 168.5, 231.8, 224.5, 192.8, 122.9, 336.5, 
185.9, 194.3, 149.5, 210.1, 273.3, 191.4, 287, 226, 303.6, 289.9, 
421.6, 264.5, 342.3, 339.7, 440.4, 315.9, 439.3, 401.3, 437.4, 
575.5, 407.6, 682, 475.3, 581.3, 646.9)), class = "data.frame", row.names = c(NA, 
-36L))

We get this, which looks much better:

enter image description here

EDIT 2

The problem is to do with missing data. Some dates in the shampoo dataset are not in the SPY dataset. The following will select the data from the nearest date in the SPY dataset to overcome the problem of missing data. However, the graph it generates still looks odd, and changing the dates a little seems to be cause of the problem.

Replacing the line:

df_historic_with_SPY_close <- merge(df, Close_no_rename, by = "ds")

With (credit to https://stackoverflow.com/a/28073823/7967291):

library(data.table)
setDT(Close_no_rename)
setDT(df)

setkey(Close_no_rename, ds)[, dateMatch:=ds]
df_historic_with_SPY_close<-Close_no_rename[df, roll='nearest']
df_historic_with_SPY_close<-setDT(df_historic_with_SPY_close)[,-1]
names(df_historic_with_SPY_close)[names(df_historic_with_SPY_close) == "dateMatch"] <- "ds"
df_historic_with_SPY_close
df_historic_with_SPY_close <- mutate (  df_historic_with_SPY_close,  ds = ymd(ds) )
str(df_historic_with_SPY_close)
Community
  • 1
  • 1
Grubbmeister
  • 857
  • 8
  • 25
  • okay great; I see the bounty expires in 14 hours. I'm at work now but I'll try this solution when I get home later today (around 8 hours from now) – DeathbyGreen Dec 05 '19 at 14:25
  • @DeathbyGreen ok, let me know how it goes – Grubbmeister Dec 05 '19 at 23:12
  • tried your code, it looks like it helped for sure. the numbers its predicting are still strange looking but maybe thats just because stocks arent a good predictor. I plotted only the stock forecast and the stock forecast looks reasonable. but adding it as a regressor makes the swings for the shampoo predictions look crazy no matter how low i set the prior. – DeathbyGreen Dec 05 '19 at 23:30
  • Yes, I think the dates were not matched in format. I don't know why it wouldn't be a good predictor, since `summary(lm(y~ SPY_CLOSE,data=df_historic_with_SPY_close))` gives a signifcant relationship with R2 of 0.62, which you'd expect to be better than the above graph – Grubbmeister Dec 06 '19 at 00:31
  • @DeathbyGreen You do know that the SPY dataset is missing observations for thirteen months? – Grubbmeister Dec 06 '19 at 00:34
  • what 7 months? tail and head commands show it starting in 2017-01 and ending at present – DeathbyGreen Dec 06 '19 at 00:39
  • 13 actually, edited. In `df_historic_with_SPY_close <- merge(df, Close_no_rename, by = "ds")` the dataset is only 23 lines long instead of 36 - take a look. When you look at the SPY data set, you see that those first days of the month are sometimes missing. – Grubbmeister Dec 06 '19 at 00:43
  • I think the issue is the merge by date bc the shampoo set is always the first of the month and the SPY doesnt always have data, like on weekends. So I think some logic needs to be in place to interpolate the SPY data or something to make sure there is a value on every day – DeathbyGreen Dec 06 '19 at 00:53
  • @DeathbyGreen see edits for a solution to getting the nearest date. However, there are still problems with the graph which are currently beyond me. – Grubbmeister Dec 06 '19 at 01:55
  • @DeathbyGreen did you ever figure this out fully? I am running into the exact same issue with stock data with an extra regressor and my predictions are really fluctuating and making it look like an unrealistic prediction. I don't think the date matching is an issue either. Just wondering if you have maybe solved it a different way outside of date. Thanks in advance. – nak5120 Jul 14 '20 at 03:37
  • I posted a similar question here if you have found a solution to this already: https://stackoverflow.com/questions/62971261/prophet-prediction-with-regressors-has-outlier-results – nak5120 Jul 18 '20 at 17:02