0

I am looking to remove duplicate rows based on the values in a column ("Name"), but append the corresponding string values in another column("Occupation").

Duplicate entry is "Jack"

I have a dataframe:

Name center Occupation
Jack Miami Clerk
Alice Tx Manager
Jack San Jose PO
Cathy Houston Security

And i am expecting this

Name center Occupation
Jack Miami Clerk PO
Alice Tx Manager
Cathy Houston Security

Appreciate any answer on this

  • so for "Occupation" you would want to join the values, what about for "center"? Do you just want to keep the first match per name? Or if not, what type of aggregation you'd want there? – sophocles Nov 18 '22 at 14:51

2 Answers2

1

Seems like you're looking for something along the lines of:

df.groupby('Name',as_index=False).agg(
    {'center':'first','Occupation':lambda x: ' '.join(x)}
    )

which get's you:

    Name   center Occupation
0  Alice       Tx    Manager
1  Cathy  Houston   Security
2   Jack    Miami   Clerk PO

However, I can't be sure what type of aggregation you'd want for 'center' - using first now which get's your desired outcome.

sophocles
  • 13,593
  • 3
  • 14
  • 33
  • Thanks for your answer, but i would like to keep the same order in the column 'Name' --Jack, Alice, cathy –  Nov 18 '22 at 15:14
  • Please then include ```df.groupby('Name',sort=False)``` in the ```groupby``` clause – sophocles Nov 18 '22 at 15:21
0

You can do a groupby, and then aggregate the columns as you like. For occupation you already wrote to join, for center I chose first.

out = df.groupby('Name', as_index=False, sort=False).agg({'center': 'first', 'Occupation': ' '.join})
print(out)

Output:

    Name   center Occupation
0   Jack    Miami   Clerk PO
1  Alice       Tx    Manager
2  Cathy  Houston   Security
Rabinzel
  • 7,757
  • 3
  • 10
  • 30
  • Thanks for your answer, but i would like to keep the same order in the column 'Name' --Jack, Alice, cathy –  Nov 18 '22 at 15:15