Removing all non-unique rows from a dataframe

Question

Sorry, this is my second post - please let me know if something doesn't make sense!

I'm trying to remove all rows that have any duplicates. I've tried the keep = False parameter for drop_duplicates(), and its just not doing the right thing.

lets say my dataframe looks something like this

|ORDER ID | ITEM CODE |
123         XXX    
123         YYY
123         YYY
456         XXX
456         XXX
456         XXX
789         XXX
000         YYY

I want it to look like this:

|ORDER ID | ITEM CODE |
123         XXX    
789         XXX
000         YYY

You want to get all unique values. Does this help? https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html — MSH, Nov 05 '21 at 19:21

score 2 · Answer 1 · answered Nov 05 '21 at 19:58

2

Try using

df = df.drop_duplicates(subset='ORDER ID')

answered Nov 05 '21 at 19:58

Julio S.

944
1
12
26

this keeps ONE of the duplicate values, I need anything that isnt a unique row to be dropped – datanerd Nov 05 '21 at 20:03
Ok, so your example of how you want your DF to look like is wrong, since it does not match what you are requesting, right? – Julio S. Nov 05 '21 at 20:05
1

no, please look again, the only rows left are unique values – datanerd Nov 05 '21 at 20:08
"123" is not unique. It occurs 3 times in your DF. – Julio S. Nov 05 '21 at 20:11
the subset would be on both the columns – datanerd Nov 05 '21 at 20:31

yakkaya · Answer 2 · 2021-11-06T03:37:39.937

let's define your sample DataFrame,

data = {"ORDER ID":[123, 123, 123, 456, 456, 456, 789, 000], "ITEM CODE":['XXX', 'YYY', 'YYY', 'XXX', 'XXX', 'XXX', 'XXX', 'YYY']}

df = pd.DataFrame(data)

 ORDER ID ITEM CODE
  123       XXX
  123       YYY
  123       YYY
  456       XXX
  456       XXX
  456       XXX
  789       XXX
  000       YYY

You can remove duplicates based on desired columns or all columns, subset parameter can be a list of column names.

new_df = df.drop_duplicates(subset='ORDER ID')

 ORDER ID ITEM CODE
  123       XXX
  456       XXX
  789       XXX
  000       YYY

score 0 · Answer 3 · answered Nov 05 '21 at 19:46

0

So I suggest that you use a loop to iterate through every row, then whilst iterating through each line use an if statement to compare the current row to the last, if it is exclude, if it isn't return the row.

answered Nov 05 '21 at 19:46

the dataset is too large to iterate through each row and the duplicates arent necessarily bunched together, unless you have a piece of code i could try – datanerd Nov 05 '21 at 20:00
Oh so you want to just remove them entirely? Try to add duplicates to a case list that of which you want removed, then it will update the list, also if input is too large you can always section the first couple of lines in the list then append to a new file. – Nov 05 '21 at 20:11

score 0 · Answer 4 · answered Nov 05 '21 at 20:13

I managed to compile the answer from two other answers:

We shall find the lines to drop. https://stackoverflow.com/a/64105947/2681662
We use that dataframe to drop it. https://stackoverflow.com/a/44706892/2681662

Find lines to drop:

import pandas as pd

lst = [
    [123, "XXX"],
    [123, "YYY"],
    [123, "YYY"],
    [456, "XXX"],
    [456, "XXX"],
    [456, "XXX"],
    [789, "XXX"],
    [000, "YYY"],
]

df = pd.DataFrame(lst, columns=["ORDER ID", "ITEM CODE"])

to_drop = df[pd.DataFrame(df.sort_values(by=["ORDER ID", "ITEM CODE"]), index=df.index).duplicated()]

Drop all lines according to `to_drop`

So the whole code would look like:

import pandas as pd

lst = [
    [123, "XXX"],
    [123, "YYY"],
    [123, "YYY"],
    [456, "XXX"],
    [456, "XXX"],
    [456, "XXX"],
    [789, "XXX"],
    [000, "YYY"],
]

df = pd.DataFrame(lst, columns=["ORDER ID", "ITEM CODE"])

to_drop = df[pd.DataFrame(df.sort_values(by=["ORDER ID", "ITEM CODE"]), index=df.index).duplicated()]

print(pd.merge(df,to_drop, indicator=True, how='outer')
         .query('_merge=="left_only"')
         .drop('_merge', axis=1))

this looks like it would work! ill let you know if it does - thanks so much! — datanerd, Nov 05 '21 at 20:49
hey so i run into an error that says i cannot reindex from a duplicated axis. Also I should let you know that there are other columns in my table so i ran the duplicated() with a subset parameter — datanerd, Nov 08 '21 at 16:03

Removing all non-unique rows from a dataframe

4 Answers4

Find lines to drop:

Drop all lines according to `to_drop`

Linked

Related

Removing all non-unique rows from a dataframe

4 Answers4

Find lines to drop:

Drop all lines according to to_drop

Linked

Related

Drop all lines according to `to_drop`