0

Sorry, this is my second post - please let me know if something doesn't make sense!

I'm trying to remove all rows that have any duplicates. I've tried the keep = False parameter for drop_duplicates(), and its just not doing the right thing.

lets say my dataframe looks something like this

|ORDER ID | ITEM CODE |
123         XXX    
123         YYY
123         YYY
456         XXX
456         XXX
456         XXX
789         XXX
000         YYY

I want it to look like this:

|ORDER ID | ITEM CODE |
123         XXX    
789         XXX
000         YYY
Velocibadgery
  • 3,670
  • 2
  • 12
  • 17
datanerd
  • 25
  • 4

4 Answers4

2

Try using

df = df.drop_duplicates(subset='ORDER ID')
Julio S.
  • 944
  • 1
  • 12
  • 26
1

let's define your sample DataFrame,

data = {"ORDER ID":[123, 123, 123, 456, 456, 456, 789, 000], "ITEM CODE":['XXX', 'YYY', 'YYY', 'XXX', 'XXX', 'XXX', 'XXX', 'YYY']}

df = pd.DataFrame(data)

 ORDER ID ITEM CODE
  123       XXX
  123       YYY
  123       YYY
  456       XXX
  456       XXX
  456       XXX
  789       XXX
  000       YYY

You can remove duplicates based on desired columns or all columns, subset parameter can be a list of column names.

new_df = df.drop_duplicates(subset='ORDER ID')

 ORDER ID ITEM CODE
  123       XXX
  456       XXX
  789       XXX
  000       YYY
yakkaya
  • 11
  • 2
0

So I suggest that you use a loop to iterate through every row, then whilst iterating through each line use an if statement to compare the current row to the last, if it is exclude, if it isn't return the row.

  • the dataset is too large to iterate through each row and the duplicates arent necessarily bunched together, unless you have a piece of code i could try – datanerd Nov 05 '21 at 20:00
  • Oh so you want to just remove them entirely? Try to add duplicates to a case list that of which you want removed, then it will update the list, also if input is too large you can always section the first couple of lines in the list then append to a new file. –  Nov 05 '21 at 20:11
0

I managed to compile the answer from two other answers:

  1. We shall find the lines to drop. https://stackoverflow.com/a/64105947/2681662
  2. We use that dataframe to drop it. https://stackoverflow.com/a/44706892/2681662

Find lines to drop:

import pandas as pd

lst = [
    [123, "XXX"],
    [123, "YYY"],
    [123, "YYY"],
    [456, "XXX"],
    [456, "XXX"],
    [456, "XXX"],
    [789, "XXX"],
    [000, "YYY"],
]

df = pd.DataFrame(lst, columns=["ORDER ID", "ITEM CODE"])

to_drop = df[pd.DataFrame(df.sort_values(by=["ORDER ID", "ITEM CODE"]), index=df.index).duplicated()]

Drop all lines according to to_drop

So the whole code would look like:

import pandas as pd

lst = [
    [123, "XXX"],
    [123, "YYY"],
    [123, "YYY"],
    [456, "XXX"],
    [456, "XXX"],
    [456, "XXX"],
    [789, "XXX"],
    [000, "YYY"],
]

df = pd.DataFrame(lst, columns=["ORDER ID", "ITEM CODE"])

to_drop = df[pd.DataFrame(df.sort_values(by=["ORDER ID", "ITEM CODE"]), index=df.index).duplicated()]

print(pd.merge(df,to_drop, indicator=True, how='outer')
         .query('_merge=="left_only"')
         .drop('_merge', axis=1))
MSH
  • 1,743
  • 2
  • 14
  • 22
  • this looks like it would work! ill let you know if it does - thanks so much! – datanerd Nov 05 '21 at 20:49
  • hey so i run into an error that says i cannot reindex from a duplicated axis. Also I should let you know that there are other columns in my table so i ran the duplicated() with a subset parameter – datanerd Nov 08 '21 at 16:03