2

I have a dataset of Total Sales from 2008-2015. I have an entry for each day, and so I have a created a pandas DataFrame with a DatetimeIndex and a column for sales. So it looks like thisenter image description here

The problem is that I am missing data for most of 2010. These missing values are currently represented by 0.0 so if I plot the DataFrame I get

enter image description here

I want to try forecast values for 2016, possibly using an ARIMA model, so the first step I took was to perform a decomposition of this time series

enter image description here

Obviously if I leave 2010 in the DataFrame any attempted prediction will be skewed by the apparent, albeit erroneous, drop in sales.

What is the recommended approach in this situation? I think I should just drop 2010 altogether, but then I don't know if my time series is valid going from 2009 to 2011. I don't want to fill the missing values, because I don't believe I can do so accurately.

If I just delete 2010, however, the plot 'fills in' 2010 which doesn't help me

sales = sales.drop(sales['2010'].index)

enter image description here

Philip O'Brien
  • 4,146
  • 10
  • 46
  • 96
  • Your question is a better fit at CrossValidated (http://stats.stackexchange.com), though searching for "time series missing data" there might give you the answer. – Alicia Garcia-Raboso Jun 30 '16 at 12:02
  • I think dropping 2010 as of beeing incomplete is a good idea. Then you can not plot/forecast based on the timestamp/date as your last figure shows. Yet you can easily plot/forecast using the index of each entry, but label the baseline with each associated date. Thus the date will be visible, yet no hole will result from dropping 2010. – Nikolas Rieble Jul 05 '16 at 13:07

0 Answers0