0

I have a dataframe df

    id  email   firstname   lastname    salutation
    2be858a0458faa569d3d    user_a@gmail.com                Lastname    
    2be858a0458faa569d3d    user_a@gmail.com    Firstname       
    2be858a0458faa569d3d    user_a@gmail.com    Firstname                Mr

I want to have a consolidated dataframe df_consolidated

    2be858a0458faa569d3d    user_a@gmail.com    Firstname   Lastname Mr

The logic should be that it takes all values from df and "sums" them up to one row.

Any idea?

2 Answers2

1

You can use groupby_first:

>>> df.groupby('id', as_index=False).first()

                     id             email  firstname  lastname salutation
0  2be858a0458faa569d3d  user_a@gmail.com  Firstname  Lastname         Mr

If empty values are empty strings you can replace '' by np.nan first:

>>> df.replace({'': np.nan}).groupby('id', as_index=False).first()

                     id             email  firstname  lastname salutation
0  2be858a0458faa569d3d  user_a@gmail.com  Firstname  Lastname         Mr
Corralien
  • 109,409
  • 8
  • 28
  • 52
0

You need some sort of identifier what is considered the "same".

If all rows are identical and you just want one, you can use:

df.drop_duplicates()

or the answer of @Corralien.

If all rows which should be aggregated share a specific trait, e.g. 'id' you can use:

df.groupby(['id']).apply(set)

which will return 1 row and a set of unique values for each id and column. A set in python is unordered, so if the order matters you can use the keys of a dictionary as a replacement, see here: Does Python have an ordered set?

Andreas
  • 8,694
  • 3
  • 14
  • 38