5

I'm struggling to figure out how to remove rows from a pandas dataframe in which two specified columns have the same value across a row.

For example, in the below examples I would like to remove the rows which have duplicate values in the columns 2 and 4.

For example:

Column1 Column2 Column3 Column4
  Pat     123     John    456
  Pat     123     John    345 
  Jimmy   678     Mary    678 
  Larry   678     James   983

Would turn into:

Column1 Column2 Column3 Column4
  Pat     123     John    456 
  Pat     123     John    345
  Larry   678     James   983

Any help is appreciated, thank you!

cs95
  • 379,657
  • 97
  • 704
  • 746
  • Possible duplicate of [This](https://stackoverflow.com/questions/43951558/remove-rows-that-two-columns-have-the-same-values-by-pandas) – Karn Kumar Jan 02 '19 at 16:48
  • @pygo More variety of answers here, have marked that as dupe of this instead. – cs95 Jan 02 '19 at 17:23
  • @coldspeed, How could that be marked as duplicate as that being the old post , However, this is necent one to that, i'm afraid about the correct procedure of doing duplicate. – Karn Kumar Jan 02 '19 at 17:28
  • @pygo I usually refer people to [this comment](https://stackoverflow.com/questions/53645882/pandas-merging-101#comment94216191_53645882) I made on another post of mine. TLDR; closure is based on post quality, not age. – cs95 Jan 02 '19 at 17:29
  • Okay, it make sense, thx. – Karn Kumar Jan 02 '19 at 17:34

2 Answers2

9

Series.ne (!=)

df[df['Column2'] != df['Column4']]

  Column1  Column2 Column3  Column4
0     Pat      123    John      456
1     Pat      123    John      345
3   Larry      678   James      983

Or, using operator.ne:

df[operator.ne(df['Column2'], df['Column4'])]

  Column1  Column2 Column3  Column4
0     Pat      123    John      456
1     Pat      123    John      345
3   Larry      678   James      983

Compare the two; get a mask, then filter.

With loc, we can also supply a callback (suggested by @W-B!).

df.loc[lambda x : x['Column2'] != x['Column4']]

  Column1  Column2 Column3  Column4
0     Pat      123    John      456
1     Pat      123    John      345
3   Larry      678   James      983

query

df.query('Column2 != Column4')

  Column1  Column2 Column3  Column4
0     Pat      123    John      456
1     Pat      123    John      345
3   Larry      678   James      983

np.vectorize

import operator
f = pd.np.vectorize(lambda x, y: x != y)
df[f(df['Column2'], df['Column4'])]

  Column1  Column2 Column3  Column4
0     Pat      123    John      456
1     Pat      123    John      345
3   Larry      678   James      983

...Just for fun.


List Comprehension

df[[x != y for x, y in zip(df['Column2'], df['Column4'])]]

  Column1  Column2 Column3  Column4
0     Pat      123    John      456
1     Pat      123    John      345
3   Larry      678   James      983

Faster than you think!

Community
  • 1
  • 1
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 2
    `df.loc[lambda x : x['Column2']!=x['Column4']] ` – BENY Jan 02 '19 at 16:32
  • 1
    @W-B Nice one, forgot loc supports callbacks! – cs95 Jan 02 '19 at 16:34
  • Does `pd.np.vectorize(operator.ne)` have any benefit over just `operator.ne`? In fact, won't the former work row-wise, which is probably not what you want? – jpp Jan 02 '19 at 16:37
  • @coldspeed Thanks for the great answer! In my dataframe 'column2' is of type int64, while 'column4' is an object. For some reason your answer isn't removing duplicate values between these two columns. –  Jan 02 '19 at 16:38
  • 1
    @jpp You're right, it does not. I moved the answers around a bit and replaced `ne` with a `lambda` (...just for fun!). – cs95 Jan 02 '19 at 16:39
  • 1
    @sg123 If `column4` is an object, try this: `df['column4'] = pd.to_numeric(d['column4'], errors='coerce')` See [here](https://stackoverflow.com/a/47942854/4909087) for more information. – cs95 Jan 02 '19 at 16:40
  • @sg123, see my updated answer where `df[df.Column2.ne(df.Column4)]` looks to be simplest one. – Karn Kumar Jan 02 '19 at 18:46
  • @pygo ne is just the operator equivalent of !=, Same as this. Actually in terms of simplicity I would argue for query. – cs95 Jan 02 '19 at 18:53
  • @coldspeed, You are correct its just a wrapper over Arithmetic operations API ,However, query method under the hood passed to `DataFrame.loc` and if that fails because of a multidimensional key (e.g., a DataFrame) then the result will be passed to `DataFrame.__getitem__()`. BUt Good to see multiple way of doing it. – Karn Kumar Jan 03 '19 at 03:53
  • How would you do this if you wanted to pass a list of ff columns instead of explicitly calling out df['Column2'], df['Column4']? I have a use case where I need to compare where about 9 column rows match and then drop them if they do. – Joe Rivera Mar 14 '20 at 00:10
  • @JoeRivera you could change `zip(df['Column2'], df['Column4'])` to `df[list_of_cols].values.tolist()`. – cs95 Mar 14 '20 at 01:37
  • @cs95 I received this error ValueError: too many values to unpack (expected 2) `test_df = final_df[[x != y for x, y in final_df[is_updated_cols].values.tolist()]]` – Joe Rivera Mar 14 '20 at 01:58
  • @JoeRivera it's too much to explain in a comment but essentially you will need the same number of variables as columns you are slicing on. I'm not sure what you're trying to do so I can't advise further. I suggest to open a question. – cs95 Mar 14 '20 at 04:13
  • @cs95 here is a solution provided by @jezreal `L = ['S','T'] df = df[df[L].ne(df[L[0]], axis=0).any(axis=1)]` in case anyone else is looking to pass a list of column names to match. – Joe Rivera Mar 14 '20 at 14:49
3

Just another way around:

Solution with masking the matched values..

>>> mask = (df.Column2 == df.Column4)
>>> df[~mask]
  Column1  Column2 Column3  Column4
0     Pat      123    John      456
1     Pat      123    John      345
3   Larry      678   James      983

Or simply using Boolean indexing ...

>>> df[ df.Column2 != df.Column4 ]
  Column1  Column2 Column3  Column4
0     Pat      123    John      456
1     Pat      123    John      345
3   Larry      678   James      983

another one with DataFrame.drop method:

>>> df.drop(df[(df.Column2) == (df.Column4)].index)
  Column1  Column2 Column3  Column4
0     Pat      123    John      456
1     Pat      123    John      345
3   Larry      678   James      983

In addition to above one, if you want to make changes directly into the dataFrame, use inplace=True:

df.drop(df[(df.Column2) == (df.Column4)].index, inplace=True)

Another Nice solution with pandas.DataFrame.ne Wrapper for flexible comparison methods ne.

>>> df[df.Column2.ne(df.Column4)]
  Column1  Column2 Column3  Column4
0     Pat      123    John      456
1     Pat      123    John      345
3   Larry      678   James      983
Karn Kumar
  • 8,518
  • 3
  • 27
  • 53