3

here's a sample dataset i've created for this question:

data1 = pd.DataFrame([['1','303','3/7/2016'],
                     ['4','404','6/23/2011'],
                     ['7','101','3/7/2016'],
                     ['1','303','5/6/2017']],
                    columns=["code", "ticket #", "CB date"])
data1['CB date'] = pd.to_datetime(data1['CB date'])
data2 = pd.DataFrame([['1','303','2/5/2016'],
                     ['4','404','6/23/2011'],
                     ['7','101','3/17/2016'],
                     ['1','303','4/6/2017']],
                    columns=["code", "ticket #", "audit date"])
data2['audit date'] = pd.to_datetime(data2['audit date'])
print(data1)
print(data2)
  code ticket #    CB date
0    1      303 2016-03-07
1    4      404 2011-06-23
2    7      101 2016-03-07
3    1      303 2017-05-06
  code ticket # audit date
0    1      303 2016-02-05
1    4      404 2011-06-23
2    7      101 2016-03-17
3    1      303 2017-04-06

I want to merge the two df's, and make sure that the CB dates are always on or after Audit dates:

data_all = data1.merge(data2, how='inner', on=['code', 'ticket #'])
data_all = data_all[data_all['audit date'] <= data_all['CB date']]
print(data_all)
  code ticket #    CB date audit date
0    1      303 2016-03-07 2016-02-05
2    1      303 2017-05-06 2016-02-05
3    1      303 2017-05-06 2017-04-06
4    4      404 2011-06-23 2011-06-23

However, I only want to keep the rows with earliest CB date after each audit date. So in above output, row 2 shouldn't be there, because row 1 and row 2 both have same audit date 2016/2/5, but I only want to keep row 1 since the CB date is much closer to 2016/2/5 than row 2 CB date does.

Desired output:

code ticket #    CB date audit date
0    1      303 2016-03-07 2016-02-05
3    1      303 2017-05-06 2017-04-06
4    4      404 2011-06-23 2011-06-23

I know in SQL I'd have to gorupby code & ticket # & Audit date first, then order CB date in ascending order, then take the item rank = 1 in each group; but how can I do this in Python/Pandas?

I read other posts here but I am still not getting it, so would really appreciate some advice here.

Few posts I read into include:

  1. Pandas Groupy take only the first N Groups
  2. Pandas: select the first couple of rows in each group
alwaysaskingquestions
  • 1,595
  • 5
  • 22
  • 49

1 Answers1

2

I'd do this with an optional sort_values call and a drop_duplicates call.

data_all.sort_values(data_all.columns.tolist())\
          .drop_duplicates(subset=['CB date'], keep='first')
  code ticket #    CB date audit date
0    1      303 2016-03-07 2016-02-05
2    1      303 2017-05-06 2016-02-05
4    4      404 2011-06-23 2011-06-23

I say the sort_values call is optional here, since your data appears to be sorted already. If it isn't, make sure that's part of your solution.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • thanks so much for always helping me out! really appreciate your knowledge. however, i want to sort the 'CB date' within each 'audit date' group, i dont see your code using any 'audit date' input? if i just sort it, then it'll sort based on the entire list of 'CB date' regardless of which 'audit date' they are corresponding to, right? – alwaysaskingquestions Oct 26 '17 at 18:46
  • @alwaysaskingquestions `.sort_values(data_all.columns.tolist())` sorts hierarchically. First sorts all codes, then in each code group sorts all tickets, then in each ticket group sorts all CB dates, and then in each CB date group sorts all audits. Notice the argument that I pass to the function - the list of all column names. – cs95 Oct 26 '17 at 18:47
  • my actual data has a lot more columns than my sample shows, and i dont want to sort based on ALL columns, but only the important few. so in that case, i would replace "data_all" to "data_all['colA','colB','colC']" in the .sort_values(data_all.columns.tolist()), correct? thanks for being patient with me! – alwaysaskingquestions Oct 26 '17 at 18:49
  • @alwaysaskingquestions Yes, that would do. Maybe passing just `['CB date', 'audit date']` should be enough - I was being a bit too paranoid with the initial solution. – cs95 Oct 26 '17 at 18:50