Consolidate a dataframe based on conditions

Question

I have a dataframe df

    id  email   firstname   lastname    salutation
    2be858a0458faa569d3d    user_a@gmail.com                Lastname    
    2be858a0458faa569d3d    user_a@gmail.com    Firstname       
    2be858a0458faa569d3d    user_a@gmail.com    Firstname                Mr

I want to have a consolidated dataframe df_consolidated

    2be858a0458faa569d3d    user_a@gmail.com    Firstname   Lastname Mr

The logic should be that it takes all values from df and "sums" them up to one row.

Any idea?

Empty values are NaN or empty strings? – Corralien Jan 10 '23 at 15:59 — Corralien, Jan 10 '23 at 15:59

score 1 · Answer 1 · answered Jan 10 '23 at 16:02

You can use groupby_first:

>>> df.groupby('id', as_index=False).first()

                     id             email  firstname  lastname salutation
0  2be858a0458faa569d3d  user_a@gmail.com  Firstname  Lastname         Mr

If empty values are empty strings you can replace '' by np.nan first:

>>> df.replace({'': np.nan}).groupby('id', as_index=False).first()

                     id             email  firstname  lastname salutation
0  2be858a0458faa569d3d  user_a@gmail.com  Firstname  Lastname         Mr

Andreas · Answer 2 · 2023-01-10T16:28:04.807

You need some sort of identifier what is considered the "same".

If all rows are identical and you just want one, you can use:

df.drop_duplicates()

or the answer of @Corralien.

If all rows which should be aggregated share a specific trait, e.g. 'id' you can use:

df.groupby(['id']).apply(set)

which will return 1 row and a set of unique values for each id and column. A set in python is unordered, so if the order matters you can use the keys of a dictionary as a replacement, see here: Does Python have an ordered set?

Consolidate a dataframe based on conditions

2 Answers2