1

I have a large dataset where I'm using an id number as the index. Before I start processing, I need to deal with a handful of duplicate ids that show up, but it needs to be done based on multiple criteria - essentially I want the best entry out of all the duplicate entries for each id.

In this example, I might first want it to take the duplicate with the most recent year - that will make id 1 unique. But id 3 still has two values left, so then I want it to go onto the next condition - in this case we'll say maximum val:

import pandas as pd
data = {'id':[1, 1, 2, 3, 3, 3], 'year':[2001, 2002, 2002, 2001, 2002, 2002], 'val':[7, 8, 9, 17, 11, 12]}
df = pd.DataFrame(data).set_index('id')

dups = df[df.index.duplicated(keep=False)]

for id in dups.index.unique():
    current = dups.loc[id]
    max_yr = current['year'].max()
    #continue to process using multiple conditionals inside the loop

Where the result is this:

    val  year
id
1     7  2001
1     8  2002
2     9  2002
3    17  2001
3    11  2002
3    12  2002

Turns to this:

    val  year
id
1     8  2002
2     9  2002
3    12  2002

It works, but it's very inefficient and I feel like there must be a vectorized or at least better way to accomplish this. A groupby could be incorporated, but I'm still not sure how I would do this without just looping through the groups.

Jeff
  • 2,158
  • 1
  • 16
  • 29

1 Answers1

1

The idea is to add a compound column which will solely determine the best match. For example, if the first criterion is max(year), and the second is max(val), you can combine them into a tuple and take the largest value of that column. If some criterion is min(X), then you can add -X to the tuple. (tuples are compared element by element)

import pandas as pd
data = {'id':[1, 1, 2, 3, 3, 3], 'year':[2001, 2002, 2002, 2001, 2002, 2002], 'val':[7, 8, 9, 17, 11, 12]}
df = pd.DataFrame(data).set_index('id')
df['year_val'] = df.apply(lambda row: (row.year, row.val), axis=1)
df.reset_index().groupby('id').apply(
    lambda df: df[df['year_val'] == df['year_val'].max()]).drop(
    ['year_val'], axis=1).set_index('id')

    val  year
id           
1     8  2002
2     9  2002
3    12  2002        
Community
  • 1
  • 1
Dennis Golomazov
  • 16,269
  • 5
  • 73
  • 81