Pandas: How to remove rows with duplicate compound keys, while keeping missing values distributed among the duplicates?

Question

Desired outcome

I have a table of data that looks like this:

And I want to transform that table to look like this:

Problem description

The ID and Event# fields are a compound key that represents one unique entry in the table.

Entries can be duplicated two or more times. But some of the row values are distributed among the duplicates. And I don't always know whether those row values are found in the "first", "last", or some "middle" duplicate.

I want to remove the duplicate entries, while keeping all the populated row values, regardless of where they're distributed amongst the duplicates.

How can I do this with Pandas?

Looking at some SO posts I think I need to use groupby and fillna or ffill/bfill. But I'm new to Pandas and don't understand how I can make that work under these conditions:

Rows are distinguished with a compound key
There are instances where there's more than 1 duplicate row
There's valid data in more than 1 field distributed across those duplicates
I don't always know if the valid row data is located in the "first", "last", or some "middle" duplicate

Here's the dataframe:

df = pd.DataFrame([['ABC111',   1,  '1/1/23 12:00:00',  None,               '1/1/23 13:30:00',  None], 
    ['ABC111',      2,  '1/2/23 00:00:00',  None,               '1/2/23 13:30:00',  None], 
    ['ABC111',      3,  '1/3/23 00:00:00',  None,               '1/3/23 13:30:00',  None], 
    ['ABC112',      1,  '1/1/23 00:00:00',  None,               '1/1/23 13:30:00',  None], 
    ['ABC112',      2,  '1/2/23 00:00:00',  'Test Value A',     None,               None], 
    ['ABC112',      2,  '1/2/23 00:00:00',  'Test Value A',     None,               None], 
    ['ABC112',      2,  None,               None,               '1/2/23 13:30:00',  'Test Value B'], 
    ['ABC113',      1,  '1/1/23 00:00:00',  None,               '1/1/23 13:30:00',  None], 
    ['ABC113',      2,  '1/2/23 00:00:00',  None,               '1/2/23 13:30:00',  None], 
    ['ABC113',      3,  None,               None,               '1/3/23 13:30:00',  'Test Value B'], 
    ['ABC113',      3,  '1/3/23 00:00:00',  'Test Value A',     None,               None], 
    ['ABC114',      1,  '1/1/23 00:00:00',  'Test Value A',     None,               None], 
    ['ABC114',      1,  None,               None,               '1/1/23 13:30:00',  'Test Value B'], 
    ['ABC114',      1,  None,               None,               '1/1/23 13:30:00',  'Test Value B'], 
    ['ABC114',      1,  None,               None,               '1/1/23 13:30:00',  'Test Value B'], 
    ['ABC114',      1,  None,               None,               '1/1/23 13:30:00',  'Test Value B'], 
    ['ABC114',      2,  '1/2/23 00:00:00',  None,               '1/2/23 13:30:00',  None], 
    ['ABC114',      3,  '1/3/23 00:00:00',  None,               '1/3/23 13:30:00',  None]],
    columns=['ID', 'Event #', 'Start Date', 'Start Value', 'End Date', 'End Value'])

This SO post is the closest potential solution I could find: Pandas: filling missing values by mean in each group

score 2 · Accepted Answer · answered Feb 16 '23 at 20:35

It looks like you want a groupby.first:

out = df.groupby(['ID', 'Event #'], as_index=False).first()

Output:

        ID  Event #       Start Date   Start Value         End Date     End Value
0   ABC111        1  1/1/23 12:00:00          None  1/1/23 13:30:00          None
1   ABC111        2  1/2/23 00:00:00          None  1/2/23 13:30:00          None
2   ABC111        3  1/3/23 00:00:00          None  1/3/23 13:30:00          None
3   ABC112        1  1/1/23 00:00:00          None  1/1/23 13:30:00          None
4   ABC112        2  1/2/23 00:00:00  Test Value A  1/2/23 13:30:00  Test Value B
5   ABC113        1  1/1/23 00:00:00          None  1/1/23 13:30:00          None
6   ABC113        2  1/2/23 00:00:00          None  1/2/23 13:30:00          None
7   ABC113        3  1/3/23 00:00:00  Test Value A  1/3/23 13:30:00  Test Value B
8   ABC114        1  1/1/23 00:00:00  Test Value A  1/1/23 13:30:00  Test Value B
9   ABC114        2  1/2/23 00:00:00          None  1/2/23 13:30:00          None
10  ABC114        3  1/3/23 00:00:00          None  1/3/23 13:30:00          None

That solved the problem, thank you! Much easier than I expected. — Nickolas Peter O'Malley, Feb 16 '23 at 20:56

Pandas: How to remove rows with duplicate compound keys, while keeping missing values distributed among the duplicates?

Desired outcome

Problem description

1 Answers1