1

I have a data frame (dfCust) like so:

|cust_key|first_name|last_name|address        |
-----------------------------------------------
|12345   |John      |Doe      |123 Some street|
|12345   |John      |Doe      |123 Some st    |
|67890   |Jane      |Doe      |456 Some street|

and I would like to basically remove duplicate records such that the cust_key field is unique. I do not care about the record that is dropped, at the point that this happens, the addresses have already been deduplicated so the only ones that trickle through are spelling errors. I would like the following resulting dataframe:

|cust_key|first_name|last_name|address        |
-----------------------------------------------
|12345   |John      |Doe      |123 Some street|
|67890   |Jane      |Doe      |456 Some street|

in R this would basically be done like this:

dfCust <- unique(setDT(dfCust), by = "cust_key")

but I need a way to do this in pandas.

DBA108642
  • 1,995
  • 1
  • 18
  • 55
  • 1
    `df.drop_duplicates('cust_key')` for dropping duplicates based on a single col: `cust_key` – anky Jan 08 '20 at 16:51
  • perfect, thank you. I knew it was something small I was missing. If you put this into an answer I'll upvote and accept! – DBA108642 Jan 08 '20 at 16:52
  • That's okay, its a dupe: check this: https://stackoverflow.com/questions/50885093/how-do-i-remove-rows-with-duplicate-values-of-columns-in-pandas-data-frame – anky Jan 08 '20 at 16:54

1 Answers1

4
df.drop_duplicates(subset='cust_key')