Python de-aggregation

Question

I have a data set which is aggregated between two dates and I want to de-aggregate it daily by dividing total number with days between these dates. As a sample

StoreID Date_Start    Date_End     Total_Number_of_sales
78       12/04/2015    17/05/2015    79089
80       12/04/2015    17/05/2015    79089

The data set I want is:

StoreID Date         Number_Sales 
78         12/04/2015    79089/38(as there are 38 days in between)
78         13/04/2015    79089/38(as there are 38 days in between) 
78         14/04/2015    79089/38(as there are 38 days in between)
78         ...
78         17/05/2015    79089/38(as there are 38 days in between)

Any help would be useful. Thanks

score 1 · Accepted Answer · answered Aug 08 '18 at 12:40

I'm not sure if this is exactly what you want but you can try this (I've added another imaginary row):

import datetime as dt
df = pd.DataFrame({'date_start':['12/04/2015','17/05/2015'],
                   'date_end':['18/05/2015','10/06/2015'],
                   'sales':[79089, 1000]})

df['date_start'] = pd.to_datetime(df['date_start'], format='%d/%m/%Y')
df['date_end'] = pd.to_datetime(df['date_end'], format='%d/%m/%Y')
df['days_diff'] = (df['date_end'] - df['date_start']).dt.days


master_df = pd.DataFrame(None)
for row in df.index:
    new_df = pd.DataFrame(index=pd.date_range(start=df['date_start'].iloc[row],
                          end = df['date_end'].iloc[row],
                          freq='d'))
    new_df['number_sales'] = df['sales'].iloc[row] / df['days_diff'].iloc[row]
    master_df = pd.concat([master_df, new_df], axis=0)

First convert string dates to datetime objects (so you can calculate number of days in between ranges), then create a new index based on the date range, and divide sales. The loop sticks each row of your dataframe into an "expanded" dataframe and then concatenates them into one master dataframe.

Thank you for your answer, the problem is that I want to do it for all store ids so it needs to loop all stores, so final df should also include them ,any ideas? — emkay, Aug 08 '18 at 13:23
[Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) — Parfait, Aug 08 '18 at 17:02

chuni0r · Answer 2 · 2018-08-08T14:48:19.260

What about creating a new dataframe?

start = pd.to_datetime(df['Date_Start'].values[0], dayfirst=True)
end = pd.to_datetime(df['Date_End'].values[0], dayfirst=True)
idx = pd.DatetimeIndex(start=start, end=end, freq='D')
res = pd.DataFrame(df['Total_Number_of_sales'].values[0]/len(idx), index=idx, columns=['Number_Sales'])

yields

In[42]: res.head(5)
Out[42]: 
            Number_Sales
2015-04-12   2196.916667
2015-04-13   2196.916667
2015-04-14   2196.916667
2015-04-15   2196.916667
2015-04-16   2196.916667

If you have multiple stores (according to your comment and edit), then you could loop over all rows, calculate sales and concatenate the resulting dataframes afterwards.

df = pd.DataFrame({'Store_ID': [78, 78, 80],
    'Date_Start': ['12/04/2015', '18/05/2015', '21/06/2015'],
                   'Date_End': ['17/05/2015', '10/06/2015', '01/07/2015'],
                   'Total_Number_of_sales': [79089., 50000., 25000.]})

to_concat = []
for _, row in df.iterrows():
    start = pd.to_datetime(row['Date_Start'], dayfirst=True)
    end = pd.to_datetime(row['Date_End'], dayfirst=True)
    idx = pd.DatetimeIndex(start=start, end=end, freq='D')
    sales = [row['Total_Number_of_sales']/len(idx)] * len(idx)
    id = [row['Store_ID']] * len(idx)
    res = pd.DataFrame({'Store_ID': id, 'Number_Sales':sales}, index=idx)
    to_concat.append(res)

res = pd.concat(to_concat)

There are definitley more elegant solutions, have a look for example at this thread.

Thank you for your answer, the problem is that I want to do it for all store ids so it needs to loop all stores, any ideas? — emkay, Aug 08 '18 at 13:23
No they are different. And same store can have different start and end dates as these are aggregated values for some time and within longer period of time, store has multiple rows. — emkay, Aug 08 '18 at 13:30

score 1 · Answer 3 · answered Aug 08 '18 at 17:27

Consider building a list of data frames with the DataFrame constructor iterating through each row of main data frame. Each iteration will expand a sequence of days from Start_Date to end of range with needed sales division of total sales by difference of days:

from io import StringIO
import pandas as pd
from datetime import timedelta

txt = '''StoreID Date_Start    Date_End     Total_Number_of_sales
78       12/04/2015    17/05/2015    79089
80       12/04/2015    17/05/2015    89089'''

df = pd.read_table(StringIO(txt), sep="\s+", parse_dates=[1, 2], dayfirst=True)
df['Diff_Days'] = (df['Date_End'] - df['Date_Start']).dt.days

def calc_days_sales(row): 
    long_df = pd.DataFrame({'StoreID': row['StoreID'],
                            'Date': [row['Date_Start'] + timedelta(days=i) 
                                          for i in range(row['Diff_Days']+1)],
                            'Number_Sales': row['Total_Number_of_sales'] / row['Diff_Days']})    
    return long_df

df_list = [calc_days_sales(row) for i, row in df.iterrows()]

final_df = pd.concat(df_list).reindex(['StoreID', 'Date', 'Number_Sales'], axis='columns')

print(final_df.head(10))
#    StoreID       Date  Number_Sales
# 0       78 2015-04-12   2259.685714
# 1       78 2015-04-13   2259.685714
# 2       78 2015-04-14   2259.685714
# 3       78 2015-04-15   2259.685714
# 4       78 2015-04-16   2259.685714
# 5       78 2015-04-17   2259.685714
# 6       78 2015-04-18   2259.685714
# 7       78 2015-04-19   2259.685714
# 8       78 2015-04-20   2259.685714
# 9       78 2015-04-21   2259.685714

reindex at end not needed for Python 3.6 since data frame's input dictionary will be ordered.

Python de-aggregation

3 Answers3