4

I wondered if there is a way to check and then drop certain rows which are not unique?

My data frame looks something like this:

    ID1 ID2 weight  
 0  2   4   0.5
 1  3   7   0.8 
 2  4   2   0.5 
 3  7   3   0.8
 4  8   2   0.5
 5  3   8   0.5

EDIT: I added a couple more rows to show that other unique rows that may have the same weight should be kept.

I think that when I use pandas drop_duplicates(subset=['ID1', 'ID2','weight'], keep=False) it considers each row individually but not recognise that rows 0 and 2 and 1 and 4 are in fact the same values?

msa
  • 693
  • 6
  • 21
  • The values in ID1 and ID2 are different, are you saying that the values in ID1 and ID2 are swappable? – TYZ Sep 28 '20 at 16:06
  • you just set your subset correctly `drop_duplicates(subset=['weight'])` – YOLO Sep 28 '20 at 16:07
  • @TYZ yes so the IDs may be different but the pair is the same if that makes sense? It does not matter if 2,4 is = 0.5 or 4,2 is = 0.5? – msa Sep 28 '20 at 16:08
  • @YOLO thank you for your suggestion - would this mean that other values that are also of weight 0.5 but have IDs 9,8, for example, would be dropped? – msa Sep 28 '20 at 16:09
  • 1
    @YOLO It will work for this particular data, but not gonna work in general if there's a row like `3, 4, 0.5`. – TYZ Sep 28 '20 at 16:09
  • @TYZ yes thank you this is what I meant! – msa Sep 28 '20 at 16:10

2 Answers2

4

Sort the dataframe along axis=1 then use np.unique with optional param return_index=True to get the indices of unique elements:

sub = ['ID1', 'ID2', 'weight']

idx = np.unique(np.sort(df[sub], 1), axis=0, return_index=True)[1]
df1 = df.iloc[sorted(idx)]

Alternative approach suggested by @anky:

df1 = df[~pd.DataFrame(np.sort(df[sub], 1), index=df.index).duplicated()]

print(df1)

   ID1  ID2  weight
0    2    4     0.5
1    3    7     0.8
4    8    2     0.5
5    3    8     0.5
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
1

This works, but it's kind of hacky. Create sets from columns that should be pairs and convert to tuples to get hashable types

df['new'] = df[['ID1','ID2']].apply(lambda x: tuple(set(x)), axis=1)
df.drop_duplicates(subset=['new','weight'], keep=False)

Out:

   ID1  ID2  weight     new
4    8    2     0.5  (8, 2)
5    3    8     0.5  (8, 3)
Michael Szczesny
  • 4,911
  • 5
  • 15
  • 32