Remove duplicates in pandas. copy() and drop_duplicates() is removing rows that appear only once

Question

As the question states. I am trying to get rid of duplicate rows in a df with 2 series/columns df['Offering Family', 'Major Offering'].

I hope to merge the subsequent df with another one I have based on the Major Offering column, thus only the offering family column will be transposed to the new df. I should note that I only want to get rid of rows with values that are repeated in both columns. If a value appears more than once in the Offering family column but the value in the major offering column is different, it should not be deleted. However, when I run the code below, I'm finding that I'm losing those sorts of values. Can anybody help?

df = pd.read_excel(pipelineEx, sheet_name='Data')

dfMO = df[['Offering Family', 'Major Offering']].copy()

dfMO.filter(['Offering Family', 'Major Offering'])

dfMO = df.drop_duplicates(subset=None, keep="first", inplace=False)


#dfMO.drop_duplicates(keep=False,inplace=True)
print(dfMO)

dfMO.to_excel("Major Offering.xlsx")

Can you share a few sample records so we can know what is wrong with the drop_duplicates — Joe Ferndz, Feb 16 '21 at 20:18
[How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) - we don't have access to your excel file, your [mre] should include data Representative of your real data. — wwii, Feb 16 '21 at 20:58

score 1 · Answer 1 · edited Feb 16 '21 at 20:35

1

Well there are a few things that are odd with the code you've shared.

Primarily, you created a dfM0 as a copy of df with only the two columns. But then you're applying the drop_duplicates() function on df, the original dataframe, and over-writing the dfM0 you created.

From what I understand, what you need is the dataframe to retain all unique combinations that could be made from values in the two columns. groupby() would be better suited for your purposes.

Try this:

cols = ['Offering Family', 'Major Offering']
dfM0 = df[cols].groupby(cols).count().reset_index()

reset_index() will return a copy, by default, so no additional keyword arguments are necessary.

edited Feb 16 '21 at 20:35

Dharman

30,962
25
85
135

answered Feb 16 '21 at 20:29

Aditya Karan Chhabra

144
2
4

I liked your approach, although the solution might be a bit advanced for the user. – Agyey Arya Feb 16 '21 at 20:33
@AgyeyArya I suppose the single-line code can be split up for the user's benefit. But I still stand by the `groupby` approach. – Aditya Karan Chhabra Feb 16 '21 at 20:47
what about just using drop_duplicates(inplace=True) instead. I would like to know why you would prefer group_by.count().reset_index() over the simpler solution. Wouldn't it be more complex ( time and memory wise ) – Agyey Arya Feb 16 '21 at 20:50

Agyey Arya · Accepted Answer · 2021-02-16T20:56:59.180

1

I have upated your code and as Aditya Chhabra mentioned, you are creating a copy and not using it.

df = pd.read_excel(pipelineEx, sheet_name='Data')

dfMO = df[['Offering Family', 'Major Offering']].copy()
dfMO.drop_duplicates(inplace=True)
print(dfMO)

dfMO.to_excel("Major Offering.xlsx")

edited Feb 16 '21 at 20:56

answered Feb 16 '21 at 20:35

Agyey Arya

240
1
8

Remove duplicates in pandas. copy() and drop_duplicates() is removing rows that appear only once

2 Answers2