Python pandas deduplicate data frame based on one column

Question

I have a data frame (dfCust) like so:

|cust_key|first_name|last_name|address        |
-----------------------------------------------
|12345   |John      |Doe      |123 Some street|
|12345   |John      |Doe      |123 Some st    |
|67890   |Jane      |Doe      |456 Some street|

and I would like to basically remove duplicate records such that the cust_key field is unique. I do not care about the record that is dropped, at the point that this happens, the addresses have already been deduplicated so the only ones that trickle through are spelling errors. I would like the following resulting dataframe:

|cust_key|first_name|last_name|address        |
-----------------------------------------------
|12345   |John      |Doe      |123 Some street|
|67890   |Jane      |Doe      |456 Some street|

in R this would basically be done like this:

dfCust <- unique(setDT(dfCust), by = "cust_key")

but I need a way to do this in pandas.

`df.drop_duplicates('cust_key')` for dropping duplicates based on a single col: `cust_key` — anky, Jan 08 '20 at 16:51
perfect, thank you. I knew it was something small I was missing. If you put this into an answer I'll upvote and accept! — DBA108642, Jan 08 '20 at 16:52
That's okay, its a dupe: check this: https://stackoverflow.com/questions/50885093/how-do-i-remove-rows-with-duplicate-values-of-columns-in-pandas-data-frame — anky, Jan 08 '20 at 16:54

score 4 · Answer 1 · answered Jan 08 '20 at 16:58

4

df.drop_duplicates(subset='cust_key')

answered Jan 08 '20 at 16:58

Bhosale Shrikant

463
3
7

if the Dataframes are separate then it needs to be concatinated – Bhosale Shrikant Jan 08 '20 at 17:12

Python pandas deduplicate data frame based on one column

1 Answers1