Remove duplication based on date conditions

Question

I had a dataframe like below, and I would like to remove duplications based on certain criteria. 1) If the startdate is greater than Month, it will be removed. 2) If the startdate is less than Month, keep the latest record.

>       COMP    Month       Startdate   bundle            result
> 0     TD3M    2018-03-01  2015-08-28  01_Essential      keep    
> 1     TD3M    2018-03-01  2018-07-17  04_Complete       remove
> 2     TD3M    2018-04-01  2015-08-28  01_Essential      keep
> 3     TD3M    2018-04-01  2018-07-17  04_Complete       remove
> 4     TD3M    2018-05-01  2015-08-28  01_Essential      keep
> 5     TD3M    2018-05-01  2018-07-17  04_Complete       remove
> 6     TD3M    2018-06-01  2015-08-28  01_Essential      keep
> 7     TD3M    2018-06-01  2018-07-17  04_Complete       remove
> 8     TD3M    2018-08-01  2015-08-28  01_Essential      remove
> 9     TD3M    2018-08-01  2018-07-17  04_Complete       keep
> 10    TD3M    2018-09-01  2015-08-28  01_Essential      remove
> 11    TD3M    2018-09-01  2018-07-17  04_Complete       keep

The expected output would be:

>       COMP    Month       Startdate   bundle            
> 0     TD3M    2018-03-01  2015-08-28  01_Essential      
> 2     TD3M    2018-04-01  2015-08-28  01_Essential     
> 4     TD3M    2018-05-01  2015-08-28  01_Essential     
> 6     TD3M    2018-06-01  2015-08-28  01_Essential     
> 9     TD3M    2018-08-01  2018-07-17  04_Complete  
> 11    TD3M    2018-09-01  2018-07-17  04_Complete

Possible duplicate of [Select rows from a DataFrame based on values in a column in pandas](https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas) — Erfan, Jul 18 '19 at 15:44

score 1 · Accepted Answer · answered Jul 18 '19 at 16:11

First of all, I drop your column 'result':

df = df.drop(columns='result')

First check that your Month and Startdate fields are in datetime format:

df.Month = pd.to_datetime(df.Month) df.Startdate = pd.to_datetime(df.Startdate)

Then filter and groupby (agg by max) :

df = df[df.Startdate <= df.Month] df.groupby(['COMP', 'Month'], as_index=False).max()

score 0 · Answer 2 · answered Jul 18 '19 at 16:13

Here is one way of using sort_values drop_duplicates

df.query('Startdate<=Month').sort_values('Startdate').drop_duplicates('Month',keep='last')
Out[892]: 
    COMP      Month  Startdate        bundle result
0   TD3M 2018-03-01 2015-08-28  01_Essential   keep
2   TD3M 2018-04-01 2015-08-28  01_Essential   keep
4   TD3M 2018-05-01 2015-08-28  01_Essential   keep
6   TD3M 2018-06-01 2015-08-28  01_Essential   keep
9   TD3M 2018-08-01 2018-07-17   04_Complete   keep
11  TD3M 2018-09-01 2018-07-17   04_Complete   keep

Remove duplication based on date conditions

2 Answers2