5

I have a dataframe of customers with records for shipments they received. Unfortunately, these can overlap. I'm trying to reduce rows so that I can see dates of consecutive use. Is there any way to do this besides a brute force iterrows implementation?

Here's a sample and what I'd like to do:

df = pd.DataFrame([['A','2011-02-07','2011-02-22',1],['A','2011-02-14','2011-03-10',2],['A','2011-03-07','2011-03-15',3],['A','2011-03-18','2011-03-25',4]], columns = ['Cust','startDate','endDate','shipNo'])
df

enter image description here

condensedDf = df.groupby(['Cust']).apply(reductionFunction)
condensedDF

enter image description here

the reductionFunction will group the first 3 records into one, because in each case the start date of the next is before the end date of the prior. I'm essentially turning multiple records that overlap into one record.

Thoughts on a good "pythonic" implementation? I could do a nasty while loop within each group, but I'd prefer not to...

flyingmeatball
  • 7,457
  • 7
  • 44
  • 62
  • This question is not identical, but is similar: http://stackoverflow.com/questions/21367485/merge-pandas-dataframes-based-on-irregular-time-intervals Answers there might be relevant. – jakevdp Oct 21 '15 at 16:45

3 Answers3

14

Fundamentally, I think this is a graph connectivity problem: a fast way of solving it will be some manner of graph connectivity algorithm. Pandas doesn't include such tools, but scipy does. You can use the compressed sparse graph (csgraph) submodule in scipy to solve your problem like this:

from scipy.sparse.csgraph import connected_components

# convert to datetime, so min() and max() work
df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)

def reductionFunction(data):
    # create a 2D graph of connectivity between date ranges
    start = data.startDate.values
    end = data.endDate.values
    graph = (start <= end[:, None]) & (end >= start[:, None])

    # find connected components in this graph
    n_components, indices = connected_components(graph)

    # group the results by these connected components
    return data.groupby(indices).aggregate({'startDate': 'min',
                                            'endDate': 'max',
                                            'shipNo': 'first'})

df.groupby(['Cust']).apply(reductionFunction).reset_index('Cust')

enter image description here

If you want to do something different with shipNo from here, it should be pretty straightforward.

Note that the connected_components() function above is not brute force, but uses a fast algorithm to find the connections.

jakevdp
  • 77,104
  • 11
  • 125
  • 160
  • thanks for the input! I hadn't considered this approach. Your solution works for the sample data, which only had one customer listed. However, it doesn't work for my real-life example where there are multiple Cust values. modifying the df so that it includes a "B" cust, like so df.ix[4] = ['B','2011-02-07','2011-02-22',1] breaks the solution. Any thoughts on how to make it slightly more generalizable to multiple custs? I'm not familiar with that particular scipy graph function. – flyingmeatball Oct 21 '15 at 17:47
  • The solution works for me after adding a second customer... what error are you getting? – jakevdp Oct 21 '15 at 17:49
  • ValueError: data, indices, and indptr should be 1-D – flyingmeatball Oct 21 '15 at 17:53
  • I don't see that error on Pandas 0.17 and Scipy 0.16. What versions are you using? – jakevdp Oct 21 '15 at 17:58
  • Possibly related to this bug if you're using SciPy 0.14 or older: https://github.com/scipy/scipy/issues/3818 – jakevdp Oct 21 '15 at 18:02
  • Yup - had SciPy 0.14, upgrading now. Appreciate the assistance! – flyingmeatball Oct 21 '15 at 18:03
  • 1
    @jakevdp Hi Jake, do you have any suggestions how to speed things up when having > 100000 lines and/or groups? This solution seems to work for smaller groups, but is too slow for larger dataframes. – Sander van den Oord May 25 '21 at 13:53
  • @SandervandenOord I had same issue and solved it, I posted the [answer](https://stackoverflow.com/a/68842525/7794309) which is a bit faster. – Mahdi Abadinia Aug 19 '21 at 05:38
1

I used below a bit messy but does it Faster.

Thanks to Anurag Dabas for the help.

I merge df on its own and check the overlaps then remove extra rows, do this till there is no overlaps left.

Please note that I added all "shipNo".

import pandas as pd
import numpy as np
df = pd.DataFrame([['A','2011-02-07','2011-02-22',1],['A','2011-02-14','2011-03-10',2],['A','2011-03-07','2011-03-15',3],['A','2011-03-18','2011-03-25',4]], columns = ['Cust','startDate','endDate','shipNo'])
df['startDate'] = pd.to_datetime(df['startDate'])
df['endDate'] = pd.to_datetime(df['endDate'])

def overlap_checker(data):        
        data['CuststartDateendDate']=data['Cust'].map(str)+data['startDate'].map(str)+data['endDate'].map(str)
        df2=pd.merge(data,data,on='Cust')
        df2['Overlap']=np.where((df2['startDate_x']<=df2['endDate_y'])&(df2['endDate_x']>=df2['startDate_y']) & (df2['CuststartDateendDate_x'] != df2['CuststartDateendDate_y']), 'Overlapped','not overlapped')
        df2['startDate_x']=np.where(df2['Overlap'].eq('Overlapped'),df2[['startDate_x','startDate_y']].min(axis=1),df2['startDate_x'])
        df2['endDate_x']=np.where(df2['Overlap'].eq('Overlapped'),df2[['endDate_x','endDate_y']].max(axis=1),df2['endDate_x'])
        df2['shipNo']=df2['shipNo_x'].map(str)+df2['shipNo_y'].map(str)
        df2['shipNo'] = df2['shipNo'].apply(lambda x: ' '.join(sorted(set(x))))
        df2.rename(columns = {'startDate_x':'startDate','endDate_x':'endDate'}, inplace = True)
        return df2, data
    
def overlap_remover(df, data):
        df2= df[(df['Overlap']=="Overlapped")]
        data1=data[~data['CuststartDateendDate'].isin(df2['CuststartDateendDate_x'])]
        df2 = df2.drop(columns=['startDate_y','endDate_y','Overlap','CuststartDateendDate_x','CuststartDateendDate_y','shipNo_x','shipNo_y'])
        df2 = df2.drop_duplicates()
        bigdata = data1.append(df2, ignore_index=True,sort=False)
        return bigdata


dftmp, data = overlap_checker(df)
while dftmp['Overlap'].str.contains('Overlapped').any():
    df = overlap_remover(dftmp,data)
    dftmp, data = overlap_checker(df)

df = df.drop(columns=['CuststartDateendDate'])
df = df[['Cust','startDate','endDate','shipNo']]
print(df)

0

If you are open to use an auxiliary data frame to hold the result, you can just loop through all the rows to be honest

from time import strptime

results = [df.iloc[0]]

for i, (_, current_row) in enumerate(df1.iterrows()):
    try:
        next_row = df.iloc[i+1]        
        if strptime(current_row['endDate'], '%Y-%M-%d') < strptime(next_row['startDate'], '%Y-%M-%d'):
            results[-1]['endDate'] = current_row['endDate']
            results.append(next_row)
    except IndexError:
        pass

print pd.DataFrame(results).reset_index(drop=True)
Lim H.
  • 9,870
  • 9
  • 48
  • 74
  • thanks for the input, I know I could brute force with an iterrows() call across all rows, but I'm fighting doing that like the plague. The result will be called in real time and needs to have some semblance of responsiveness. I already have an iterrows() implementation, and it is slower than I would like. – flyingmeatball Oct 21 '15 at 17:50
  • @flyingmeatball cheers I'm just gonna leave this hear for my own reference – Lim H. Oct 21 '15 at 17:51