I have a DataFrame of sales numbers with a DatetimeIndex, for data that extends over a couple of years at the minute level, and I want to first calculate totals (of sales) per year, month, day, hour and location, then average over years and month.
Then with that date, I want to extrapolate to a new month, per day, hour and location. So to do that, I calculate the sales numbers per hour for each day of the week (expecting that weekend days will behave differently from work week days), then I create a new DataFrame for the month I want to extrapolate to, then for each day in that month, I calculate (day of week, hour, POS) and use the past data for the corresponding (day of week, hour, POS) as my "prediction" for what will be sold at POS at the given hour and day in the given month.
The reason I'm doing it this way is that once I calculate a mean per day of the week in the past, when I populate the DataFrame for the month of June, the 1st of June could be any day of the week, and that is important as weekdays/weekend days behave differently. I want the past sales number for a Friday, if the 1st is a Friday.
I have the following, that is unfortunately too slow - or maybe wrong, in any case, there is no error message but it doesn't complete (on the real data):
import numpy as np
import pandas as pd
# Setup some sales data for the past 2 years for some stores
hours = pd.date_range('2018-01-01', '2019-12-31', freq='h')
sales = pd.DataFrame(index = hours, columns=['Store', 'Count'])
sales['Store'] = np.random.randint(0,10, sales.shape[0])
sales['Count'] = np.random.randint(0,100, sales.shape[0])
# Calculate the average of sales over these 2 years for each hour in
# each day of the week and each store
sales.groupby([sales.index.year, sales.index.month, sales.index.dayofweek, sales.index.hour, 'Store'])['Count'] \
.sum() \
.rename_axis(index=['Year', 'Month', 'DayOfWeek', 'Hour', 'Store']) \
.reset_index() \
.groupby(['DayOfWeek', 'Hour', 'Store'])['Count'] \
.mean() \
.rename_axis(index=['DayOfWeek', 'Hour', 'Store'])
# Setup a DataFrame to predict May sales per store/day/hour
may_hours = pd.date_range('2020-05-01', '2020-05-31', freq='h')
predicted = pd.DataFrame(index = pd.MultiIndex.from_product([may_hours, range(0,11)]), columns = ['Count']) \
.rename_axis(index=['Datetime', 'Store'])
# "Predict" sales for each (day, hour, store) in May 2020
# by retrieving the average sales for the corresponding
# (day of week, hour store)
for idx in predicted.index:
qidx = (idx[0].dayofweek, idx[0].hour, idx[1])
predicted.loc[idx] = sales[qidx] if qidx in sales.index else 0