Getting ONLY unique rows from a dataframe

Question

I'm trying to remove all rows that have any duplicates. I ONLY want the unique rows. I've tried the keep = False parameter for drop_duplicates() with `subset = [ORDER ID, ITEM CODE] , and its just not doing the right thing.

lets say my dataframe looks something like this

|ORDER ID | ITEM CODE |
123         XXX    
123         YYY
123         YYY
456         XXX
456         XXX
456         XXX
789         XXX
000         YYY

I want it to look like this:

|ORDER ID | ITEM CODE |
123         XXX    
789         XXX
000         YYY

As you can see the subset would be both the order ID and Item code columns and we would lose rows 2-6 ideally. (The actual dataset has a lot more columns.)

_it's just not doing the right thing_. Okay, but what is it doing? — OneCricketeer, Nov 08 '21 at 16:18
Do you get the data from a database? You could use `SELECT DISTINCT` as statement in your database query than. — CodeIsLaw, Nov 08 '21 at 16:18
By the way, `(123, XXX)` and `(123, YYY)` are unique rows, so your expected output seems incorrect. And where did `456` order go? — OneCricketeer, Nov 08 '21 at 16:19
unfortunately, the data is originally from a DB but this file I'm working on has been heavily manipulated so the fix would have to be in python — datanerd, Nov 08 '21 at 16:20
123,YYY is not unique since you can see it has a duplicate so it would not be required in the expected output — datanerd, Nov 08 '21 at 16:21
works absolutely fine for me. What exactly is your `.drop_duplicates()` line? What version of pandas you on? — chitown88, Nov 08 '21 at 16:24
Why do you ask [the same question as you posted yesterday](https://stackoverflow.com/questions/69858242/removing-all-non-unique-rows-from-a-dataframe)? — wovano, Nov 08 '21 at 16:57

score 1 · Answer 1 · answered Nov 08 '21 at 16:27

1

Not sure what your issue is. Works fine.

import pandas as pd


data = [[123,         'XXX', 11],    
[123,         'YYY', 22],
[123,         'YYY', 33],
[456,         'XXX', 44],
[456,         'XXX', 55],
[456,         'XXX', 66],
[789,         'XXX',77],
[000,         'YYY',88]]

columns = ['ORDER ID','ITEM CODE','extra column']

df = pd.DataFrame(data, columns=columns)

df = df.drop_duplicates(subset = ['ORDER ID','ITEM CODE'], keep=False)

Output:

Before

print(df)
   ORDER ID ITEM CODE  extra column
0       123       XXX            11
1       123       YYY            22
2       123       YYY            33
3       456       XXX            44
4       456       XXX            55
5       456       XXX            66
6       789       XXX            77
7         0       YYY            88

After

print(df)
   ORDER ID ITEM CODE  extra column
0       123       XXX            11
6       789       XXX            77
7         0       YYY            88

answered Nov 08 '21 at 16:27

chitown88

27,527
4
30
59

what version of pandas should i be on? – datanerd Nov 08 '21 at 16:31
that's so weird that it doesnt workfor me – datanerd Nov 08 '21 at 16:31
I'm on `'1.3.3'`. Try to upgrade `pip install pandas --upgrade` – chitown88 Nov 08 '21 at 16:48
are you absolutely sure the the values are duplicates? Sometimes there could be white space you're not aware of. For example `'XXX'` and `'XXX '` – chitown88 Nov 08 '21 at 16:51
omg it worked - thanks so much – datanerd Nov 08 '21 at 17:18

score 0 · Answer 2 · answered Nov 08 '21 at 16:23

0

Try modifying your subset to include ORDER ID only:

df.drop_duplicated(subset=['ORDER ID'])

answered Nov 08 '21 at 16:23

Yaakov Bressler

9,056
2
45
69

this wouldnt work because it would leave me with the first duplicated value (keep=First) is default. – datanerd Nov 08 '21 at 16:25

score 0 · Answer 3 · answered Nov 08 '21 at 16:29

It's very likely that you are simply not setting the dataframe properly. You might be doing

df.drop_duplicates()

But this would fail to overwrite your previous values. Rather you should be doing

df = df.drop_duplicates()

If you can't get drop_duplicates to work, you can use numpy.unique as a workaround.

df['ORDER_ID'] = np.unique(df['ORDER_ID'])
df['ITEM_CODE'] = np.unique(df['ITEM_CODE'])

Getting ONLY unique rows from a dataframe

3 Answers3