Merge pandas dataframes and remove duplicate rows based on condition

Question

I want to compare average revenue "in offer" vs average revenue "out of offer" for each SKU.

When I merge the below two dataframes on sku I get multiple rows for each entry because in second dataframe sku is not unique. For example every instance of sku = 1 will have two entries because test_offer contains 2 separate offers for sku 1. However there can only be one offer live for a SKU at any time, which should verify the condition:

 test_ga['day'] >= test_offer['start_day'] &  test_ga['day'] <= test_offer['end_day']

dataset 1

test_ga = pd.DataFrame( {'day': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 1, 9: 2, 10: 3, 11: 4, 12: 5, 13: 6, 14: 7, 15: 8, 16: 1, 17: 2, 18: 3, 19: 4, 20: 5, 21: 6, 22: 7, 23: 8}, 
'sku': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 3, 17: 3, 18: 3, 19: 3, 20: 3, 21: 3, 22: 3, 23: 3}, 
'revenue': {0: 12, 1: 34, 2: 28, 3: 76, 4: 30, 5: 84, 6: 55, 7: 78, 8: 23, 9: 58, 10: 11, 11: 15, 12: 73, 13: 9, 14: 69, 15: 34, 16: 71, 17: 69, 18: 90, 19: 93, 20: 43, 21: 45, 22: 57, 23: 89}} )

dataset 2

test_offer = pd.DataFrame( {'sku': {0: 1, 1: 1, 2: 2}, 
'offer_number': {0: 5, 1: 6, 2: 7}, 
'start_day': {0: 2, 1: 6, 2: 4}, 
'end_day': {0: 4, 1: 7, 2: 8}} )

Expected Output

expected_output = pd.DataFrame( {'day': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 1, 9: 2, 10: 3, 11: 4, 12: 5, 13: 6, 14: 7, 15: 8}, 
'sku': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2}, 
'offer': {0: float('nan'), 1: '5', 2: '5', 3: '5', 4: float('nan'), 5: '6', 6: '6', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '7', 12: '7', 13: '7', 14: '7', 15: '7'},
'start_day': {0: float('nan'), 1: '2', 2: '2', 3: '2', 4: float('nan'), 5: '6', 6: '6', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '4', 12: '4', 13: '4', 14: '4', 15: '4'}, 
'end_day': {0: float('nan'), 1: '4', 2: '4', 3: '4', 4: float('nan'), 5: '7', 6: '7', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '8', 12: '8', 13: '8', 14: '8', 15: '8'}, 
'revenue': {0: 12, 1: 34, 2: 28, 3: 76, 4: 30, 5: 84, 6: 55, 7: 78, 8: 23, 9: 58, 10: 11, 11: 15, 12: 73, 13: 9, 14: 69, 15: 34}} )

I did actually find a solution based on this SO answer, but it took me a while and the question is not really clear.

I thought it could still be useful to create this question even if I found a solution. Besides, there are probably better ways to achieve this that do not require to create a dummy variables and sorting the dataframe?

If this question is a duplicate let me know and I will cancel it.

What specific code did you use to produce `expected_output`? — Peter Leimbigler, Jul 28 '19 at 13:14
`pd.read_clipboard()` :) and `print ("test_ga = pd.DataFrame( %s )" % (str(test_ga.to_dict())))` — Giacomo, Jul 28 '19 at 13:17
so if I understand correctly, you want to merge the two data frames on `sku` but only want the ones where day is greater than the start date and less than the end date. do you have no way to get the actual date? you could merge on `sku` then `date` but the output is quite doable. — Umar.H, Jul 28 '19 at 13:23
@Giacomo, you mentioned you found a solution based on the linked answer. Could you post that solution? Might be good to have as a benchmark / starting point — Peter Leimbigler, Jul 28 '19 at 13:27
@Datanovice Ultimately what I want to do is to compare average revenue in offer vs average revenue out of offer for the same SKU. — Giacomo, Jul 28 '19 at 14:30

Giacomo · Accepted Answer · 2019-07-28T17:34:21.707

One possible solution:

test_data = pd.merge(test_ga, test_offer, on = 'sku')
# I define if every row is in offer or not.
test_data['is_offer'] = np.where((test_data['day'] >= test_data['start_day']) & (test_data['day'] <= test_data['end_day']), 1, 0)
expected_output = test_data.sort_values(['sku','day','is_offer']).groupby(['day', 'sku']).tail(1)

and then clean up the data adding Nan values for rows not in offer.

expected_output['start_day'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['start_day'])
expected_output['end_day'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['end_day'])
expected_output['offer_number'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['offer_number'])
expected_output

Merge pandas dataframes and remove duplicate rows based on condition

1 Answers1