0

I have a DataFrame of sales numbers with a DatetimeIndex, for data that extends over a couple of years at the minute level, and I want to first calculate totals (of sales) per year, month, day, hour and location, then average over years and month.

Then with that date, I want to extrapolate to a new month, per day, hour and location. So to do that, I calculate the sales numbers per hour for each day of the week (expecting that weekend days will behave differently from work week days), then I create a new DataFrame for the month I want to extrapolate to, then for each day in that month, I calculate (day of week, hour, POS) and use the past data for the corresponding (day of week, hour, POS) as my "prediction" for what will be sold at POS at the given hour and day in the given month.

The reason I'm doing it this way is that once I calculate a mean per day of the week in the past, when I populate the DataFrame for the month of June, the 1st of June could be any day of the week, and that is important as weekdays/weekend days behave differently. I want the past sales number for a Friday, if the 1st is a Friday.

I have the following, that is unfortunately too slow - or maybe wrong, in any case, there is no error message but it doesn't complete (on the real data):

import numpy as np
import pandas as pd

# Setup some sales data for the past 2 years for some stores
hours = pd.date_range('2018-01-01', '2019-12-31', freq='h')
sales = pd.DataFrame(index = hours, columns=['Store', 'Count'])
sales['Store'] = np.random.randint(0,10, sales.shape[0])
sales['Count'] = np.random.randint(0,100, sales.shape[0])

# Calculate the average of sales over these 2 years for each hour in 
# each day of the week and each store
sales.groupby([sales.index.year, sales.index.month, sales.index.dayofweek, sales.index.hour, 'Store'])['Count'] \
        .sum() \
        .rename_axis(index=['Year', 'Month', 'DayOfWeek', 'Hour', 'Store']) \
        .reset_index() \
        .groupby(['DayOfWeek', 'Hour', 'Store'])['Count'] \
        .mean() \
        .rename_axis(index=['DayOfWeek', 'Hour', 'Store'])

# Setup a DataFrame to predict May sales per store/day/hour
may_hours = pd.date_range('2020-05-01', '2020-05-31', freq='h')
predicted = pd.DataFrame(index = pd.MultiIndex.from_product([may_hours, range(0,11)]), columns = ['Count']) \
            .rename_axis(index=['Datetime', 'Store'])

# "Predict" sales for each (day, hour, store) in May 2020
# by retrieving the average sales for the corresponding
# (day of week, hour store)
for idx in predicted.index:
    qidx = (idx[0].dayofweek, idx[0].hour, idx[1])
    predicted.loc[idx] = sales[qidx] if qidx in sales.index else 0
Frank
  • 4,341
  • 8
  • 41
  • 57
  • In this case it would be _really_ helpful to see some sample input and desired output, to make a [mcve]. See [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – G. Anderson Jan 10 '20 at 17:57
  • I can try but I have to anonymize real company data :-/ The above is already obfuscated to some extent, while keeping the structure of what needs to happen. – Frank Jan 10 '20 at 18:01
  • Just some dummy data would be sufficient. In this case, it's not clear why you're taking your data out of a dataframe into a dict, then using the dict to put it back into a new dataframe. I feel like there must be some use case for `pd.merge()` in there, but it's hard to say without more clarity – G. Anderson Jan 10 '20 at 18:04
  • 1
    @G.Anderson - that is simple - the reason I'm pulling the data into a dict is only that I don't know better. That's kind of the core of the question: there must be a better way to do this than pulling the data into a dict. – Frank Jan 10 '20 at 18:21
  • I also added more details about the day of week issue. Basically I want to fill in the sales for June 1st, 2nd... per hour and POS, but I want the sales numbers for Tuesday, Wednesday, Thursday if 6/1, 6/2, 6/3 ... are indeed those days of the week. – Frank Jan 10 '20 at 18:26
  • 1
    @anky_91 - will do. – Frank Jan 10 '20 at 19:02
  • @G.Anderson Added more detailed code. – Frank Jan 10 '20 at 21:53

0 Answers0