0

P.S. NOT JUST "Fruit" repeat I want "ref" repeated also drop it!!! Thanks ALL!!

Q1: Can pandas drop duplicates if any one cell is duplicated (all data from csv)

df1 read csv:

Fruit ref

Banana 123

Banana 456

Apple 789

Apple 456

apricot 123

Orange 999

===============

I want:

Fruit ref

Banana 123

Apple 789

Orange 999

i.e If duplicated on Fruit OR duplicated on ref then remove whole row, not matter which columns repeated, just keep the first

Q2: Another question is similar but in anther csv

df2 read csv:

Fruit ref

Banana 8778

Apple 7899

Mango 999

Grapes 8778

Pear 5465

Cherry 7445

Mango 5465

===========

I want:

Fruit ref

Pear 5465

Cherry 894

Similar, I want drop duplicated not matter which columns repeated, just keep the first

Banana, Apple is repeated in df1 so drop it

999 also repeated in df1 so drop it

5465 is repeated in df2(same df) so drop it

Thus, just keep Pear 5465 and Cherry 894

Q3: I have many csv data is similar, how can I remove all if repeated in COL(Fruit) or COL(ref)? or it must concat or merge to one csv then use drop_duplicates?

Sorry, my English not good and difficult to explain what I want, how you know what I want to say.

antony yu
  • 5
  • 2

3 Answers3

0
df = pd.DataFrame({
    'Fruit' : ['Banana', 'Banana', 'Apple', 'Apple', 'Apricot', 'Orange'],
    'ref' : [123, 456, 789, 456, 123, 999]
})

    Fruit   ref
0   Banana  123
1   Banana  456
2   Apple   789
3   Apple   456
4   Apricot 123
5   Orange  999
df1 = df.groupby(["Fruit"]).filter(lambda df:df.shape[0] == 1)
df2 = df.groupby(["ref"]).filter(lambda df:df.shape[0] == 1)
df = pd.concat([df1, df2]).drop_duplicates()

Output:

    Fruit   ref
4   Apricot 123
5   Orange  999
2   Apple   789

Also check this post to know more. here

0

For first chain conditions for both columns:

df = df1[~df1['Fruit'].duplicated() & ~df1['ref'].duplicated()]
print (df)
    Fruit  ref
0  Banana  123
2   Apple  789
5  Orange  999

For second first join both DataFrames and remove duplicates same way, for values from df2 only use inner join by df2:

df12 = pd.concat([df1, df2])
df = df12[~df12['Fruit'].duplicated() & ~df12['ref'].duplicated()].merge(df2)
print (df)
    Fruit   ref
0    Pear  5465
1  Cherry  7445
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
-1

just a hack,

>>> import pandas as pd
>>> data=[['banana',123],['banana',456],['apple',789],['apricot',123],['orange','999']]
>>> df = pd.DataFrame(data,columns=['fruit','ref'])
>>> df.drop_duplicates(subset=['fruit'])
     fruit  ref
0   banana  123
2    apple  789
3  apricot  123
4   orange  999
>>> (df.drop_duplicates(subset=['fruit'])).drop_duplicates(subset=['ref'])
    fruit  ref
0  banana  123
2   apple  789
4  orange  999

click here for snippet

  • 1
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/32469454) – Nimantha Aug 12 '22 at 08:30
  • hey this is is only for the image that i was trying to paste, since stackoverflow wont allow me to paste the image directly it generated a link – archana_prasad Aug 12 '22 at 08:36
  • **NOTE**: As of Revision 3 (which added the code block from the image) the answer is **NOT** a link only answer – Nimantha Aug 12 '22 at 08:40
  • Thanks. But I can't make it to useful – antony yu Aug 12 '22 at 09:01