2

Consider the dataframe df

   A  B  C   D  match?
0  x  y  1   1  true
1  x  y  1   2  false
2  x  y  2   1  false
3  x  y  2   2  true
4  x  y  3   4  false
5  x  y  5   6  false

I would like to drop the unmatched rows that are already matched somewhere else.

   A  B  C  D  match?
1  x  y  1  1  true
3  x  y  2  2  true
4  x  y  3  4  false
5  x  y  5  6  false

How can I do that with Pandas?

Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
fast_cen
  • 1,297
  • 3
  • 11
  • 28

3 Answers3

3

You could sort those two columns so that their order of positioning could be made same throughout. Then, drop off all such duplicated entries present by providing keep=False in DF.drop_duplicates() method.

df[['C','D']] = np.sort(df[['C','D']].values)
df.drop_duplicates(keep=False)

enter image description here

Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
  • This seems do to the trick, even though, you have to be careful because "C" and "D" value can be swap (if the D is greater than C, not the case here) – fast_cen Jan 20 '17 at 15:06
  • Yeah, that's why I had to sort them before so that they're uniform throughout. – Nickil Maveli Jan 20 '17 at 15:10
2

you can compare the two columns with

df.C == df.D

0     True
1    False
2    False
3     True
4    False
dtype: bool

Then shift the series down.

0      NaN
1     True
2    False
3    False
4     True
dtype: object

Each True value indicates the start of a new group. We can use cumsum to create the groupings we need for groupby

(df.C == df.D).shift().fillna(False).cumsum()

0    0
1    1
2    1
3    1
4    2
dtype: int64

Then use groupy + last

df.groupby(df.C.eq(df.D).shift().fillna(False).cumsum()).last()

   A  B  C  D
0  x  y  1  1
1  x  y  2  2
2  x  y  3  4
piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

If you would like to remove the rows where "C" and "D" matched, the method .ix will help you:

df = df.ix[(df['C'] != df['D'])]

Therefore, df['C'] != df['D'] generates a list of booleans and .ix allows you to extract the corresponding DataFrame :)