Combine Dataframe rows on conditions

Question

I have a pandas-dataframe that looks like:

INPUT - here the example runnable code to create the INPUT:

#Create Dataframe with example data
df_example = pd.DataFrame(columns=["START_D","ID_1", "ID_2", "STOP_D"])
df_example["START_D"] = ['2014-06-16', '2014-06-01', '2016-05-01','2014-05-28', '2014-05-20', '2015-09-01']  
df_example['ID_1'] = [1,2,3,2,1,1]
df_example['ID_2'] = ['a', 'a', 'b', 'b', 'a', 'a']
df_example["STOP_D"] = ['2014-07-28', '2014-07-01', '2016-06-01', '2014-08-01', '2014-07-29', '2015-10-01']  

#Convert to datetime
df_example["START_D"] = pd.to_datetime(df_example["START_D"])
df_example["STOP_D"] = pd.to_datetime(df_example["STOP_D"])
df_example

 START_D  ID_1 ID_2     STOP_D
 0 2014-06-16     1    a 2014-07-28
 1 2014-06-01     2    a 2014-07-01
 2 2016-05-01     3    b 2016-06-01
 3 2014-05-28     2    b 2014-08-01
 4 2014-05-20     1    a 2014-07-29
 5 2015-09-01     1    a 2015-10-01

and I am looking for a way to group by ID_1 and merge the rows where the START_D and STOP_D overlaps. The start_d will be the smallest and the stop_d the greatest. Below you can see the desired output that I get looping over all rows (iterrows) and checking one element at time.

OUTPUT Even if this approach works I think it is slow (for large DF) and I think there must be a more pythonic-pandas way to do that.

>>> df_result
     START_D    ID_1     STOP_D
  0 2014-05-20     1 2014-07-29
  1 2014-05-28     2 2014-08-01
  2 2016-05-01     3 2016-06-01
  3 2015-09-01     1 2015-10-01

thanks!

Please check [How to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — jezrael, Oct 24 '16 at 12:14

score 1 · Accepted Answer · answered Oct 24 '16 at 15:01

sort_values
groupby('ID_1')
track STOP_D.cummax() and see if START_D is less than prior cummax
cumsum to generate groupings
agg to grab min START_D and max STOP_D

df_example = df.sort_values(['START_D', 'STOP_D'])

def collapse(df):
    s, e = 'START_D', 'STOP_D'
    grps = df[s].gt(df[e].cummax().shift()).cumsum()
    funcs = {s: 'min', e: 'max', 'ID_1': 'first'}
    return df.groupby(grps).agg(funcs)

df_example.groupby('ID_1').apply(collapse).reset_index(drop=True)

score 0 · Answer 2 · answered Oct 24 '16 at 12:41

The difficulty in your problem is that the aggregation needs to result in a single entry. So if there are non-overlapping START_D and STOP_D, but the ID1 is the same, no aggregation (even custom made) will work. I recommend the following steps:

Loop through each ID and ensure that the desired overlap is always occurring. This may be able to be vectorized with some witty coding. In cases where a conflict is found, generate a new ID (using a new column like ID3) to resolve the conflict. Otherwise just put ID1 into ID3 if no conflict exists.

Do a groupby using ID3 (or whatever you chose to call it)

df_result = df_example.groupby(['ID1']).agg({START_D: min, STOP_D: max})

The key to the performance boost is coming up with a vectorized solution to checking for start and stop conflict. Good luck! Hope this helps!

Combine Dataframe rows on conditions

2 Answers2