How to group by months from starting date, ending date and categorical variables

Question

there is a list of HR:

  Department       Start         End  Salary per month
0      Sales  01.01.2020  30.04.2020              1000
1     People  01.05.2020  30.07.2022              3000
2  Marketing  01.02.2020  30.12.2099              3200
3      Sales  01.03.2020  01.08.2023              1200
4   Engineer  01.04.2020  30.12.2099              3500

Using python I'm wondering how to group total salary by month (starting from very beginning of the first employee) and also by the department

I have tried to group salaries by date range but don't know how to also group with the department and show each department in each column

month_starts = pd.date_range(data.start.min(),data.end.max(),freq = 'MS').to_numpy()
contained = np.logical_and(
    np.greater_equal.outer(month_starts, data['start'].to_numpy()),
    np.less.outer(month_starts,data['end'].to_numpy())
)
masked = np.where(contained, np.broadcast_to(data[['salary_per_month']].transpose(), contained.shape),np.nan)
df = pd.DataFrame(masked, index = month_starts).agg('sum',axis=1).to_frame().reset_index()
df.columns = ['month', 'total_cost']

Expected output:

enter image description here

To group by month, use `df.groupby([pd.to_datetime(df['start']).dt.to_period('M'), 'Department']).sum()` (for the Year-Month), or `df.groupby([pd.to_datetime(df['start']).dt.month, 'Department']).sum()` for the months independently of the year (1-12). — mozway, Apr 12 '23 at 09:39
Do you want to explode the start/end into several months? Please provide a clear reproducible example (no image!) and the matching expected output — mozway, Apr 12 '23 at 09:47
Correct, I want to see to cost for each department for each month. But there is only list of employees with the contract starting and ending date with salary for each month. — nguyen anh, Apr 12 '23 at 09:54
I see, I updated your question with a reproducible input and provided an answer — mozway, Apr 12 '23 at 09:57

mozway · Answer 1 · 2023-04-12T10:05:37.267

You can repeat your rows and increment the periods, then pivot_table :

s = pd.to_datetime(df['Start'], dayfirst=True).dt.to_period('M')
e = pd.to_datetime(df['End'], dayfirst=True).dt.to_period('M')

n = (e-s).apply(lambda x: x.n)

(df.assign(Date=s)
   .loc[df.index.repeat(n+1)]
   .assign(Date=lambda d: d['Date']+d.groupby(level=0).cumcount())
   .pivot_table(index='Date', columns='Department', values='Salary per month',
                aggfunc='sum', fill_value=0)
)

Output:

Department  Engineer  Marketing  People  Sales
Date                                          
2020-01            0          0       0   1000
2020-02            0       3200       0   1000
2020-03            0       3200       0   2200
2020-04         3500       3200       0   2200
2020-05         3500       3200    3000   1200
...              ...        ...     ...    ...
2099-08         3500       3200       0      0
2099-09         3500       3200       0      0
2099-10         3500       3200       0      0
2099-11         3500       3200       0      0
2099-12         3500       3200       0      0

[960 rows x 4 columns]

@nguyenanh yes of course, subtracting the 2 periods gives a number of periods, this code is to extract the integer part — mozway, Apr 12 '23 at 10:36

How to group by months from starting date, ending date and categorical variables

1 Answers1