Time Series Forecasting of daily sales data

Question

Hi i am newbie to timeseries forecasting as well as R. I have collected daily sales data from my brothers new startup company to learn forecasting (from 28-03-19 to 7-7-21). I want to forecast for next 30 days can someone help me to forecast this data. I am sharing link this CSV file. I followed steps shown by someone from kaggle and tried forecasting with ARIMA, but i am getting a whole different values compared to actual values on test dataset with MAPE value of 28%. Hoping someone can guide me in solving this.

https://easyupload.io/hltskc

Here is the sample data for year 2021:

  data.frame(date=seq(as.Date("2021-01-01"),as.Date("2021-07-07"),by="1 day"),sales=c(100,105 ,167, 106 ,112, 107, 202,  98, 120, 109 ,114, 195, 110, 121,  89, 128, 104, 194 ,107 ,127, 117, 100, 117 ,205,116, 112, 119, 129, 161, 132, 110, 114, 118, 194, 114, 113, 113 ,172, 101 ,161 ,102, 135,  97, 122, 170, 126 ,160, 110, 118, 108,  111, 163, 110, 123 ,102 ,116, 181, 119, 155, 108, 122, 169, 115, 122, 116, 168 ,115 ,101, 117, 113 ,163, 115 ,107, 106, 171 ,109, 119, 107, 101 ,166, 105, 102 ,174,108, 102, 114,  97, 114, 149, 100, 111,  94,110 ,108, 100 , 92 ,104, 112, 160, 105,  98,  91,117 , 44,  60 , 36 , 50 , 51 , 54,  62,  61 , 62 , 50 , 59 , 85 , 49,  61 , 56 , 63,  39, 110 , 56 , 54 , 55,  56,  63 , 44, 115,  55,  50,  96 ,129 , 61,  59,  98 , 90 ,153,  90,  82 , 98,  79, 149,  97 , 85,  92,  78, 100 , 69, 152,  88,  76 , 91 ,145, 106,  69,  84,  72, 144 , 76, 74 , 94  ,70 , 86  ,76 ,137 , 71  ,87 , 91,  62 ,150 , 66 , 77  ,88, 135,  93 , 62 , 83, 83 , 72,  71 ,148 , 91 , 68 , 78 , 95 ,124,  69  ,78))

Hi there and welcome to SO. Please take a look at [How to Ask](https://stackoverflow.com/help/how-to-ask) for hints. It's a good start to give some data, make a [great reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and give an example of your desired output. You could edit your question and put some sample data there using `dput()` or `data.frame()`. Usally people on SO prefer not to click on external links. — Martin Gal, Jul 10 '21 at 17:56
@najeel it is customary to upvote and accept answers that solve your query. If they do not, please post clarifying details regarding your question. — hello_friend, Aug 17 '21 at 10:44

DPH · Answer 1 · 2021-08-15T23:55:49.213

A easy way out is the prophet package that facebook published. This prediction is pretty painless to solve but you have to make the quality asessment (rate the predictions).

library(prophet) # possibly have to install it
set.seed(1234) # make it reproducable
# set the column names that prophet requires (ds has to be date type)
names(df) <- c("ds", "y")
# build the model in automatic mode (there are some parameter to tweak)
mod <- prophet::prophet(df)
# build the future df (days you want to predict
fdf <- prophet::make_future_dataframe(mod, periods = 92)
# make the prediction
forec <- predict(mod, fdf)
# plot the data and forecast
plot(mod, forec)

head(forec$yhat)
[1] 121.2439 137.5267 140.1426 132.0549 134.5965 136.5455

You might want to look into aspects of analyzing your time series in terms of need for transformation (BoxCox, log, diff, etc.). Looking at your data (black dots) one could interpret, that there is an external regressor (aditional explanatory variable) due to the end of Abril and May data beeing comparetively low. This indepedent variable could be predicatable/know and thus have mayor impact on the prediction quality. Since you are using dayly data, there might be some information obscured to holidays that are not correctly represented in the modelling from prophet. A closing advice concerns the data length: depending on your prediction horizon roughly six months input can be not enough, espcially if there is some yearly pattern to be considered (a rule of thumb would be to use at least 2 years of data, better 3).

Possibly you might be interested in the forecast package and its functions, though I think they will not perform very well with dayly interval and short input data:

library(forecast)
# make time series object
tsdf <- ts(df$y, frequency = 365)
# make an auto.arima forecast model
mod2 <- forecast::auto.arima(log(tsdf))
# predict
plot(forecast::forecast(mod2, h = 92))

# make a "hip" prediction using neural net with 3 hidden neurons
mod3 <- forecast::nnetar(tsdf, P = 3, repeats = 1000)
plot(forecast::forecast(mod3, h = 92))

some things you might what to look/read/try:

data used:

df <- data.frame(date=seq(as.Date("2021-01-01"),as.Date("2021-07-07"),by="1 day"),sales=c(100,105 ,167, 106 ,112, 107, 202,  98, 120, 109 ,114, 195, 110, 121,  89, 128, 104, 194 ,107 ,127, 117, 100, 117 ,205,116, 112, 119, 129, 161, 132, 110, 114, 118, 194, 114, 113, 113 ,172, 101 ,161 ,102, 135,  97, 122, 170, 126 ,160, 110, 118, 108,  111, 163, 110, 123 ,102 ,116, 181, 119, 155, 108, 122, 169, 115, 122, 116, 168 ,115 ,101, 117, 113 ,163, 115 ,107, 106, 171 ,109, 119, 107, 101 ,166, 105, 102 ,174,108, 102, 114,  97, 114, 149, 100, 111,  94,110 ,108, 100 , 92 ,104, 112, 160, 105,  98,  91,117 , 44,  60 , 36 , 50 , 51 , 54,  62,  61 , 62 , 50 , 59 , 85 , 49,  61 , 56 , 63,  39, 110 , 56 , 54 , 55,  56,  63 , 44, 115,  55,  50,  96 ,129 , 61,  59,  98 , 90 ,153,  90,  82 , 98,  79, 149,  97 , 85,  92,  78, 100 , 69, 152,  88,  76 , 91 ,145, 106,  69,  84,  72, 144 , 76, 74 , 94  ,70 , 86  ,76 ,137 , 71  ,87 , 91,  62 ,150 , 66 , 77  ,88, 135,  93 , 62 , 83, 83 , 72,  71 ,148 , 91 , 68 , 78 , 95 ,124,  69  ,78))

score 0 · Accepted Answer · answered Aug 16 '21 at 08:09

Here's a fairly basic way of averaging (taking the median) of the outputs from multiple models and producing a forecast. Note, none of the necessary EDA, model tuning, or model validation been carried out either prior or after modelling. Additionally, I'm not certain this method of averaging the prediction intervals is sound. For a best practice approach please see here.

# Install pacakges if they are not already installed: necessary_packages => vector
necessary_packages <- c("forecast", "ggplot2")

# Create a vector containing the names of any packages needing installation:
# new_pacakges => vector
new_packages <- necessary_packages[!(necessary_packages %in%
                                       installed.packages()[, "Package"])]

# If the vector has more than 0 values, install the new pacakges
# (and their) associated dependencies:
if (length(new_packages) > 0) {
  install.packages(new_packages, dependencies = TRUE)
}

# Initialise the packages in the session: list of boolean => stdout (console)
lapply(necessary_packages, require, character.only = TRUE)

# Coerce to multiple seasonality time series object: 
y <- msts(
  df$sales,
  start = c(
    min(as.integer(strftime(df$date, "%Y"))), 
    1
  ),
  seasonal.periods = c(
    7.009615384615385, 
    30.5
  )
)

# Function to coalesce fit values with actuals: 
coalesceActualsFit <- function(fit_obj){
  
  # Coalesce actual values with fit values:
  # res => double vector
  res <- transform(
    setNames(
      data.frame(
        cbind(
          fit_obj$x, 
          fit_obj$fitted
        )[,1:2]
      ),
      c("x", "xhat")
    ),
    xhat = ifelse(is.na(xhat), x, xhat)
  )[,2,drop=TRUE]
  
  # Explicitly define returned object: 
  # double vector => Global Env()  
  return(res)
  
}

# Store the forecast period (n-days):
fcast_period <- 30

# Fit a holt winters model:
hw_fit <- HoltWinters(y)

# Remove NAs from the fitted values: 
hw_fit_vals <- coalesceActualsFit(hw_fit)

# Forecast out n-days:
hw_fcast <- forecast(hw_fit, h = fcast_period)

# Fit a neural network: 
nnetar_fit <- nnetar(y, P = 10, repeats = 50)

# Remove NAs from the fitted values: 
nnetar_fit_vals <- coalesceActualsFit(nnetar_fit)

# Forecast out n-days: 
nnetar_fcast <- forecast(nnetar_fit, h = fcast_period)

# Fit an stlf model: 
stlf_fit <- stlf(y)

# Forecast out n-days: 
stlf_fcast <- forecast(stlf_fit, h = fcast_period)

# Create an ensemble for the predictions: 
med_predict <- apply(
  cbind(hw_fit_vals, nnetar_fit_vals, stlf_fit$fitted),
  1, 
  median
)

# Create an ensemble for the forecasts: 
med_fcast <- apply(
  cbind(hw_fcast$mean, nnetar_fcast$mean, stlf_fcast$mean),
  1, 
  median
)

# Calculate the enesmble prediction intervals:
fcast_pis <- do.call(
  cbind,
  lapply(
    c("lower", "upper"),
    function(x){
      y <- data.frame(
        cbind(
          hw_fcast[[x]],
          nnetar_fcast[[x]],
          stlf_fcast[[x]]
          )
        )
      pi_80 <- apply(
        y[,endsWith(names(y), "80.")],
        1,
        median
      )
      pi_95 <- apply(
        y[,endsWith(names(y), "95.")],
        1,
        median
      )
      cbind(
        setNames(data.frame(pi_80), paste0(x, "_80")),
        setNames(data.frame(pi_95), paste0(x, "_95"))
      )
    }
  )
)

# Combine them into a single vector:
pt_fcasts <- c(
  med_predict, 
  med_fcast
)

# Create a data.frame containing the original data and 
# the point forecasts: 
fcast_df <- transform(
  data.frame(
    actuals = c(df$sales, rep(NA_real_, fcast_period)),
    pnt_fcasts = pt_fcasts,
    date = 
      as.Date(
        c(
          df$date,
          seq(
            max(df$date) + 1, 
            max(df$date) + fcast_period, 
            by = 1
          )
        ),
        "%Y-%m-%d"
      )
  ),
  pnt_fcasts = ifelse(is.na(pnt_fcasts), sales, pnt_fcasts)
)


# Adjust the forecast predition intervals to be the same
# length as the combined data: 
fcast_pis_adjusted <- transform(
  rbind(
    setNames(
      data.frame(
        do.call(
          cbind, 
          lapply(seq_len(ncol(fcast_pis)), function(i){
            pt_fcasts[1:nrow(df)]
          }
          )
        )
      ),
      names(fcast_pis)
    ),
    fcast_pis
  ),
  date = fcast_df$date
)

# Merge with the point forecast data: 
chart_data <- merge(
  fcast_df,
  fcast_pis_adjusted,
  by = "date"
)

# Chart the forecast vs actuals: 
ggplot(chart_data, aes(date)) + 
  geom_line(aes(y = actuals, colour = "Actuals")) + 
  geom_line(aes(y = lower_80, colour = "Lower 80 PI")) + 
  geom_line(aes(y = lower_95, colour = "Lower 95 PI")) +
  geom_line(aes(y = upper_80, colour = "Upper 80 PI")) +
  geom_line(aes(y = upper_95, colour = "Upper 95 PI")) +
  geom_line(aes(y = pnt_fcasts, colour = "Point Forecast")) +
  xlab("Date") +
  ylab("Sales") + 
  labs(colour = "Series")

Data:

df <- data.frame(
  date = seq(as.Date("2021-01-01"), as.Date("2021-07-07"), by = "1 day"),
  sales = c(
    100,
    105 ,
    167,
    106 ,
    112,
    107,
    202,
    98,
    120,
    109 ,
    114,
    195,
    110,
    121,
    89,
    128,
    104,
    194 ,
    107 ,
    127,
    117,
    100,
    117 ,
    205,
    116,
    112,
    119,
    129,
    161,
    132,
    110,
    114,
    118,
    194,
    114,
    113,
    113 ,
    172,
    101 ,
    161 ,
    102,
    135,
    97,
    122,
    170,
    126 ,
    160,
    110,
    118,
    108,
    111,
    163,
    110,
    123 ,
    102 ,
    116,
    181,
    119,
    155,
    108,
    122,
    169,
    115,
    122,
    116,
    168 ,
    115 ,
    101,
    117,
    113 ,
    163,
    115 ,
    107,
    106,
    171 ,
    109,
    119,
    107,
    101 ,
    166,
    105,
    102 ,
    174,
    108,
    102,
    114,
    97,
    114,
    149,
    100,
    111,
    94,
    110 ,
    108,
    100 ,
    92 ,
    104,
    112,
    160,
    105,
    98,
    91,
    117 ,
    44,
    60 ,
    36 ,
    50 ,
    51 ,
    54,
    62,
    61 ,
    62 ,
    50 ,
    59 ,
    85 ,
    49,
    61 ,
    56 ,
    63,
    39,
    110 ,
    56 ,
    54 ,
    55,
    56,
    63 ,
    44,
    115,
    55,
    50,
    96 ,
    129 ,
    61,
    59,
    98 ,
    90 ,
    153,
    90,
    82 ,
    98,
    79,
    149,
    97 ,
    85,
    92,
    78,
    100 ,
    69,
    152,
    88,
    76 ,
    91 ,
    145,
    106,
    69,
    84,
    72,
    144 ,
    76,
    74 ,
    94  ,
    70 ,
    86  ,
    76 ,
    137 ,
    71  ,
    87 ,
    91,
    62 ,
    150 ,
    66 ,
    77  ,
    88,
    135,
    93 ,
    62 ,
    83,
    83 ,
    72,
    71 ,
    148 ,
    91 ,
    68 ,
    78 ,
    95 ,
    124,
    69  ,
    78
  )
)

Time Series Forecasting of daily sales data

2 Answers2