Python Pandas: how to take only the earliest date in each group

Question

here's a sample dataset i've created for this question:

data1 = pd.DataFrame([['1','303','3/7/2016'],
                     ['4','404','6/23/2011'],
                     ['7','101','3/7/2016'],
                     ['1','303','5/6/2017']],
                    columns=["code", "ticket #", "CB date"])
data1['CB date'] = pd.to_datetime(data1['CB date'])
data2 = pd.DataFrame([['1','303','2/5/2016'],
                     ['4','404','6/23/2011'],
                     ['7','101','3/17/2016'],
                     ['1','303','4/6/2017']],
                    columns=["code", "ticket #", "audit date"])
data2['audit date'] = pd.to_datetime(data2['audit date'])
print(data1)
print(data2)
  code ticket #    CB date
0    1      303 2016-03-07
1    4      404 2011-06-23
2    7      101 2016-03-07
3    1      303 2017-05-06
  code ticket # audit date
0    1      303 2016-02-05
1    4      404 2011-06-23
2    7      101 2016-03-17
3    1      303 2017-04-06

I want to merge the two df's, and make sure that the CB dates are always on or after Audit dates:

data_all = data1.merge(data2, how='inner', on=['code', 'ticket #'])
data_all = data_all[data_all['audit date'] <= data_all['CB date']]
print(data_all)
  code ticket #    CB date audit date
0    1      303 2016-03-07 2016-02-05
2    1      303 2017-05-06 2016-02-05
3    1      303 2017-05-06 2017-04-06
4    4      404 2011-06-23 2011-06-23

However, I only want to keep the rows with earliest CB date after each audit date. So in above output, row 2 shouldn't be there, because row 1 and row 2 both have same audit date 2016/2/5, but I only want to keep row 1 since the CB date is much closer to 2016/2/5 than row 2 CB date does.

Desired output:

code ticket #    CB date audit date
0    1      303 2016-03-07 2016-02-05
3    1      303 2017-05-06 2017-04-06
4    4      404 2011-06-23 2011-06-23

I know in SQL I'd have to gorupby code & ticket # & Audit date first, then order CB date in ascending order, then take the item rank = 1 in each group; but how can I do this in Python/Pandas?

I read other posts here but I am still not getting it, so would really appreciate some advice here.

Few posts I read into include:

score 2 · Accepted Answer · answered Oct 26 '17 at 18:42

2

I'd do this with an optional sort_values call and a drop_duplicates call.

data_all.sort_values(data_all.columns.tolist())\
          .drop_duplicates(subset=['CB date'], keep='first')

  code ticket #    CB date audit date
0    1      303 2016-03-07 2016-02-05
2    1      303 2017-05-06 2016-02-05
4    4      404 2011-06-23 2011-06-23

I say the sort_values call is optional here, since your data appears to be sorted already. If it isn't, make sure that's part of your solution.

answered Oct 26 '17 at 18:42

cs95

379,657
97
704
746

thanks so much for always helping me out! really appreciate your knowledge. however, i want to sort the 'CB date' within each 'audit date' group, i dont see your code using any 'audit date' input? if i just sort it, then it'll sort based on the entire list of 'CB date' regardless of which 'audit date' they are corresponding to, right? – alwaysaskingquestions Oct 26 '17 at 18:46
@alwaysaskingquestions `.sort_values(data_all.columns.tolist())` sorts hierarchically. First sorts all codes, then in each code group sorts all tickets, then in each ticket group sorts all CB dates, and then in each CB date group sorts all audits. Notice the argument that I pass to the function - the list of all column names. – cs95 Oct 26 '17 at 18:47
my actual data has a lot more columns than my sample shows, and i dont want to sort based on ALL columns, but only the important few. so in that case, i would replace "data_all" to "data_all['colA','colB','colC']" in the .sort_values(data_all.columns.tolist()), correct? thanks for being patient with me! – alwaysaskingquestions Oct 26 '17 at 18:49
@alwaysaskingquestions Yes, that would do. Maybe passing just `['CB date', 'audit date']` should be enough - I was being a bit too paranoid with the initial solution. – cs95 Oct 26 '17 at 18:50

Python Pandas: how to take only the earliest date in each group

1 Answers1