I'm working on a billing system.
On the one hand, I have contracts with start and end date, which I need to bill monthly. One contract can have several start/end dates, but they can't overlap for a same contract.
On the other hand, I have a df with the invoice billed per contract, with their start and end date. Invoices' start/end dates for a specific contract can't also overlap. There could be gap though between end date of an invoice and start of another invoice.
My goal is to look at the contract start/end dates, and remove all the period billed for a single contract, so that I know what's left to be billed.
Here is my data for contract:
contract_df = pd.DataFrame({'contract_id': {0: 'C00770052',
1: 'C00770052',
2: 'C00770052',
3: 'C00770052',
4: 'C00770053'},
'from': {0: pd.to_datetime('2018-07-01 00:00:00'),
1: pd.to_datetime('2019-01-01 00:00:00'),
2: pd.to_datetime('2019-07-01 00:00:00'),
3: pd.to_datetime('2019-09-01 00:00:00'),
4: pd.to_datetime('2019-10-01 00:00:00')},
'to': {0: pd.to_datetime('2019-01-01 00:00:00'),
1: pd.to_datetime('2019-07-01 00:00:00'),
2: pd.to_datetime('2019-09-01 00:00:00'),
3: pd.to_datetime('2021-01-01 00:00:00'),
4: pd.to_datetime('2024-01-01 00:00:00')}})
Here is my invoice data (no invoice for C00770053):
invoice_df = pd.DataFrame({'contract_id': {0: 'C00770052',
1: 'C00770052',
2: 'C00770052',
3: 'C00770052',
4: 'C00770052',
5: 'C00770052',
6: 'C00770052',
7: 'C00770052'},
'from': {0: pd.to_datetime('2018-07-01 00:00:00'),
1: pd.to_datetime('2018-08-01 00:00:00'),
2: pd.to_datetime('2018-09-01 00:00:00'),
3: pd.to_datetime('2018-10-01 00:00:00'),
4: pd.to_datetime('2018-11-01 00:00:00'),
5: pd.to_datetime('2019-05-01 00:00:00'),
6: pd.to_datetime('2019-06-01 00:00:00'),
7: pd.to_datetime('2019-07-01 00:00:00')},
'to': {0: pd.to_datetime('2018-08-01 00:00:00'),
1: pd.to_datetime('2018-09-01 00:00:00'),
2: pd.to_datetime('2018-10-01 00:00:00'),
3: pd.to_datetime('2018-11-01 00:00:00'),
4: pd.to_datetime('2019-04-01 00:00:00'),
5: pd.to_datetime('2019-06-01 00:00:00'),
6: pd.to_datetime('2019-07-01 00:00:00'),
7: pd.to_datetime('2019-09-01 00:00:00')}})
My expected result is:
to_bill_df = pd.DataFrame({'contract_id': {0: 'C00770052',
1: 'C00770052',
2: 'C00770053'},
'from': {0: pd.to_datetime('2019-04-01 00:00:00'),
1: pd.to_datetime('2019-09-01 00:00:00'),
2: pd.to_datetime('2019-10-01 00:00:00')},
'to': {0: pd.to_datetime('2019-05-01 00:00:00'),
1: pd.to_datetime('2021-01-01 00:00:00'),
2: pd.to_datetime('2024-01-01 00:00:00')}})
What I need therefore is to go through each row of contract_df, identify the invoices matching the relevant period and remove the periods which have already been billed from the contract_df, eventually splitting the contract_df row into 2 rows if there is a gap.
The problem is that going like this seem very heavy considering that I'll have millions of invoices and contracts, I feel like there is an easy way with pandas but I'm not sure how I could do it
Thanks