I have a large dataset where I'm using an id number as the index. Before I start processing, I need to deal with a handful of duplicate ids that show up, but it needs to be done based on multiple criteria - essentially I want the best entry out of all the duplicate entries for each id.
In this example, I might first want it to take the duplicate with the most recent year - that will make id 1 unique. But id 3 still has two values left, so then I want it to go onto the next condition - in this case we'll say maximum val:
import pandas as pd
data = {'id':[1, 1, 2, 3, 3, 3], 'year':[2001, 2002, 2002, 2001, 2002, 2002], 'val':[7, 8, 9, 17, 11, 12]}
df = pd.DataFrame(data).set_index('id')
dups = df[df.index.duplicated(keep=False)]
for id in dups.index.unique():
current = dups.loc[id]
max_yr = current['year'].max()
#continue to process using multiple conditionals inside the loop
Where the result is this:
val year
id
1 7 2001
1 8 2002
2 9 2002
3 17 2001
3 11 2002
3 12 2002
Turns to this:
val year
id
1 8 2002
2 9 2002
3 12 2002
It works, but it's very inefficient and I feel like there must be a vectorized or at least better way to accomplish this. A groupby could be incorporated, but I'm still not sure how I would do this without just looping through the groups.