Pandas drop_duplicates in any one columns and also in other csv

Question

P.S. NOT JUST "Fruit" repeat I want "ref" repeated also drop it!!! Thanks ALL!!

Q1: Can pandas drop duplicates if any one cell is duplicated (all data from csv)

df1 read csv:

Fruit ref

Banana 123

Banana 456

Apple 789

Apple 456

apricot 123

Orange 999

===============

I want:

Fruit ref

Banana 123

Apple 789

Orange 999

i.e If duplicated on Fruit OR duplicated on ref then remove whole row, not matter which columns repeated, just keep the first

Q2: Another question is similar but in anther csv

df2 read csv:

Fruit ref

Banana 8778

Apple 7899

Mango 999

Grapes 8778

Pear 5465

Cherry 7445

Mango 5465

===========

I want:

Fruit ref

Pear 5465

Cherry 894

Similar, I want drop duplicated not matter which columns repeated, just keep the first

Banana, Apple is repeated in df1 so drop it

999 also repeated in df1 so drop it

5465 is repeated in df2(same df) so drop it

Thus, just keep Pear 5465 and Cherry 894

Q3: I have many csv data is similar, how can I remove all if repeated in COL(Fruit) or COL(ref)? or it must concat or merge to one csv then use drop_duplicates?

Sorry, my English not good and difficult to explain what I want, how you know what I want to say.

Anonymous89u · Answer 1 · 2022-08-12T08:03:19.113

0

df = pd.DataFrame({
    'Fruit' : ['Banana', 'Banana', 'Apple', 'Apple', 'Apricot', 'Orange'],
    'ref' : [123, 456, 789, 456, 123, 999]
})

    Fruit   ref
0   Banana  123
1   Banana  456
2   Apple   789
3   Apple   456
4   Apricot 123
5   Orange  999

df1 = df.groupby(["Fruit"]).filter(lambda df:df.shape[0] == 1)
df2 = df.groupby(["ref"]).filter(lambda df:df.shape[0] == 1)
df = pd.concat([df1, df2]).drop_duplicates()

Output:

    Fruit   ref
4   Apricot 123
5   Orange  999
2   Apple   789

Also check this post to know more. here

edited Aug 12 '22 at 08:03

answered Aug 12 '22 at 07:37

Anonymous89u

76
8

1

NOT just Fruit!!! Is BOTH Fruit or ref repeated also drop it!!! – antony yu Aug 12 '22 at 07:39
You can add column names in subset array. – Anonymous89u Aug 12 '22 at 07:42
It just remove both Fruit AND ref repeat. I Want Fruit **OR** ref repeat – antony yu Aug 12 '22 at 07:50

jezrael · Accepted Answer · 2022-08-12T08:10:32.937

0

For first chain conditions for both columns:

df = df1[~df1['Fruit'].duplicated() & ~df1['ref'].duplicated()]
print (df)
    Fruit  ref
0  Banana  123
2   Apple  789
5  Orange  999

For second first join both DataFrames and remove duplicates same way, for values from df2 only use inner join by df2:

df12 = pd.concat([df1, df2])
df = df12[~df12['Fruit'].duplicated() & ~df12['ref'].duplicated()].merge(df2)
print (df)
    Fruit   ref
0    Pear  5465
1  Cherry  7445

edited Aug 12 '22 at 08:10

answered Aug 12 '22 at 07:49

jezrael

822,522
95
1,334
1,252

Thanks it's work and that's I want~ – antony yu Aug 12 '22 at 08:53

archana_prasad · Answer 3 · 2022-08-12T08:37:55.050

-1

just a hack,

>>> import pandas as pd
>>> data=[['banana',123],['banana',456],['apple',789],['apricot',123],['orange','999']]
>>> df = pd.DataFrame(data,columns=['fruit','ref'])
>>> df.drop_duplicates(subset=['fruit'])
     fruit  ref
0   banana  123
2    apple  789
3  apricot  123
4   orange  999
>>> (df.drop_duplicates(subset=['fruit'])).drop_duplicates(subset=['ref'])
    fruit  ref
0  banana  123
2   apple  789
4  orange  999

click here for snippet

edited Aug 12 '22 at 08:37

answered Aug 12 '22 at 07:57

archana_prasad

1
1

1

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/32469454) – Nimantha Aug 12 '22 at 08:30
hey this is is only for the image that i was trying to paste, since stackoverflow wont allow me to paste the image directly it generated a link – archana_prasad Aug 12 '22 at 08:36
**NOTE**: As of Revision 3 (which added the code block from the image) the answer is **NOT** a link only answer – Nimantha Aug 12 '22 at 08:40
Thanks. But I can't make it to useful – antony yu Aug 12 '22 at 09:01

Pandas drop_duplicates in any one columns and also in other csv

3 Answers3