get non-overlapping period from 2 dataframe with date ranges

Question

I'm working on a billing system.

On the one hand, I have contracts with start and end date, which I need to bill monthly. One contract can have several start/end dates, but they can't overlap for a same contract.

On the other hand, I have a df with the invoice billed per contract, with their start and end date. Invoices' start/end dates for a specific contract can't also overlap. There could be gap though between end date of an invoice and start of another invoice.

My goal is to look at the contract start/end dates, and remove all the period billed for a single contract, so that I know what's left to be billed.

Here is my data for contract:

contract_df = pd.DataFrame({'contract_id': {0: 'C00770052',
  1: 'C00770052',
  2: 'C00770052',
  3: 'C00770052',
  4: 'C00770053'},
 'from': {0: pd.to_datetime('2018-07-01 00:00:00'),
  1: pd.to_datetime('2019-01-01 00:00:00'),
  2: pd.to_datetime('2019-07-01 00:00:00'),
  3: pd.to_datetime('2019-09-01 00:00:00'),
  4: pd.to_datetime('2019-10-01 00:00:00')},
 'to': {0: pd.to_datetime('2019-01-01 00:00:00'),
  1: pd.to_datetime('2019-07-01 00:00:00'),
  2: pd.to_datetime('2019-09-01 00:00:00'),
  3: pd.to_datetime('2021-01-01 00:00:00'),
  4: pd.to_datetime('2024-01-01 00:00:00')}})

Here is my invoice data (no invoice for C00770053):

 invoice_df = pd.DataFrame({'contract_id': {0: 'C00770052',
  1: 'C00770052',
  2: 'C00770052',
  3: 'C00770052',
  4: 'C00770052',
  5: 'C00770052',
  6: 'C00770052',
  7: 'C00770052'},
 'from': {0: pd.to_datetime('2018-07-01 00:00:00'),
  1: pd.to_datetime('2018-08-01 00:00:00'),
  2: pd.to_datetime('2018-09-01 00:00:00'),
  3: pd.to_datetime('2018-10-01 00:00:00'),
  4: pd.to_datetime('2018-11-01 00:00:00'),
  5: pd.to_datetime('2019-05-01 00:00:00'),
  6: pd.to_datetime('2019-06-01 00:00:00'),
  7: pd.to_datetime('2019-07-01 00:00:00')},
 'to': {0: pd.to_datetime('2018-08-01 00:00:00'),
  1: pd.to_datetime('2018-09-01 00:00:00'),
  2: pd.to_datetime('2018-10-01 00:00:00'),
  3: pd.to_datetime('2018-11-01 00:00:00'),
  4: pd.to_datetime('2019-04-01 00:00:00'),
  5: pd.to_datetime('2019-06-01 00:00:00'),
  6: pd.to_datetime('2019-07-01 00:00:00'),
  7: pd.to_datetime('2019-09-01 00:00:00')}})

My expected result is:

to_bill_df = pd.DataFrame({'contract_id': {0: 'C00770052',
  1: 'C00770052',
  2: 'C00770053'},
 'from': {0: pd.to_datetime('2019-04-01 00:00:00'),
  1: pd.to_datetime('2019-09-01 00:00:00'),
  2: pd.to_datetime('2019-10-01 00:00:00')},
 'to': {0: pd.to_datetime('2019-05-01 00:00:00'),
  1: pd.to_datetime('2021-01-01 00:00:00'),
  2: pd.to_datetime('2024-01-01 00:00:00')}})

What I need therefore is to go through each row of contract_df, identify the invoices matching the relevant period and remove the periods which have already been billed from the contract_df, eventually splitting the contract_df row into 2 rows if there is a gap.

The problem is that going like this seem very heavy considering that I'll have millions of invoices and contracts, I feel like there is an easy way with pandas but I'm not sure how I could do it

Thanks

score 1 · Accepted Answer · answered Oct 15 '19 at 10:15

I was solving a similar problem the other day. It's not a simple solution but should be generic in identifying any non-overlapping intervals.

The idea is to convert your dates into continuous integers and then we can remove the overlap with a set OR operator. The function below will transform your DataFrame into a dictionary that contains a list of non-overlapping integer dates for each ID.

from functools import reduce

def non_overlapping_intervals(df, uid, date_from, date_to):
    # Convert date to day integer
    helper_from = date_from + '_helper'
    helper_to = date_to + '_helper'
    df[helper_from] = df[date_from].sub(pd.Timestamp('1900-01-01')).dt.days  # set a reference date
    df[helper_to] = df[date_to].sub(pd.Timestamp('1900-01-01')).dt.days

    out = (
        df[[uid, helper_from, helper_to]]
        .dropna()
        .groupby(uid)
        [[helper_from, helper_to]]
        .apply(
            lambda x: reduce(  # Apply for an arbitrary number of cases
                lambda a, b: a | b, x.apply(  # Eliminate the overlapping dates OR operation on set
                    lambda y: set(range(y[helper_from], y[helper_to])), # Create continuous integers for date ranges
                    axis=1
                )
            )
        )
        .to_dict()
    )
    return out

From here, we want to do a set subtraction to find the dates and IDs for which there are contracts but no invoices:

from collections import defaultdict

invoice_dates = defaultdict(set, non_overlapping_intervals(invoice_df, 'contract_id', 'from', 'to'))
contract_dates = defaultdict(set, non_overlapping_intervals(contract_df, 'contract_id', 'from', 'to'))

missing_dates = {}
for k, v in contract_dates.items():
    missing_dates[k] = list(v - invoice_dates.get(k, set()))

Now we have a dict called missing_dates that gives us each date for which there are no invoices. To convert it into your output format, we need to separate each continuous group for each ID. Using this answer, we arrive at the below:

from itertools import groupby
from operator import itemgetter

missing_invoices = []
for uid, dates in missing_dates.items():
    for k, g in groupby(enumerate(sorted(dates)), lambda x: x[0] - x[1]):
        group = list(map(int, map(itemgetter(1), g)))
        missing_invoices.append([uid, group[0], group[-1]])
missing_invoices = pd.DataFrame(missing_invoices, columns=['contract_id', 'from', 'to'])

# Convert back to datetime
missing_invoices['from'] = missing_invoices['from'].apply(lambda x: pd.Timestamp('1900-01-01') + pd.DateOffset(days=x))
missing_invoices['to'] = missing_invoices['to'].apply(lambda x: pd.Timestamp('1900-01-01') + pd.DateOffset(days=x + 1))

Probably not the simple solution you were looking for, but this should be reasonably efficient.

that is very clever way of doing it, which works fine and give the expected results ! I was considering doing something similar with a df of all days to bill per contract_id and remove the duplicates with a df of all days for all invoices but it was taking too long, creating those daily df using sthg like: pd.concat([pd.DataFrame({'contract_id': row['contract_id'], 'from': pd.date_range(row['from'], row['to'], freq='1D', closed='left')}) for index, row in contract_df.iterrows()]) just curious if someone found a clean and fast way of doing something like this with pandas — yeye, Oct 15 '19 at 13:52

get non-overlapping period from 2 dataframe with date ranges

1 Answers1