set difference for pandas

Question

A simple pandas question:

Is there a drop_duplicates() functionality to drop every row involved in the duplication?

An equivalent question is the following: Does pandas have a set difference for dataframes?

For example:

In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})

In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

In [7]: df1
Out[7]: 
   col1  col2
0     1     2
1     2     3
2     3     4

In [8]: df2
Out[8]: 
   col1  col2
0     4     6
1     2     3
2     5     5

so maybe something like df2.set_diff(df1) will produce this:

   col1  col2
0     4     6
2     5     5

However, I don't want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.

By the way, I initially thought about an extension of the current drop_duplicates() method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.

Thanks!

yes there is drop_duplicates method... CHeck documentation http://pandas.pydata.org. Depending on how data is structured you should be able to do set operations too. can definately replace items in frame with items from another frame. Not sure offhand if it is possible to check for uniqueness across all the columns — Joop, Aug 12 '13 at 06:39
Can you provide a sample data and tell us what do you try to accomplish? — Viktor Kerkez, Aug 12 '13 at 06:46
if there are non duplicates... which dataframe has the accurate ones. Ie do you merely want to find the unique items or do you need to merge them with some additional logic? — Joop, Aug 12 '13 at 06:57
@Joop If there are no duplicates, set difference returns the first dataframe (e.g. df2.set_diff(df1) returns df2) — r_31415, Aug 12 '13 at 07:01

score 70 · Answer 1 · edited Oct 28 '21 at 22:45

70

Bit convoluted but if you want to totally ignore the index data. Convert the contents of the dataframes to sets of tuples containing the columns:

ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))

This step will get rid of any duplicates in the dataframes as well (index ignored)

set([(1, 2), (3, 4), (2, 3)])   # ds1

can then use set methods to find anything. Eg to find differences:

ds1.difference(ds2)

gives: set([(1, 2), (3, 4)])

can take that back to dataframe if needed. Note have to transform set to list 1st as set cannot be used to construct dataframe:

pd.DataFrame(list(ds1.difference(ds2)))

edited Oct 28 '21 at 22:45

assembly_wizard

2,034
1
17
10

answered Aug 12 '13 at 10:43

Joop

7,840
9
43
58

Fantastic. This worked perfectly. Obviously, it reorders the rows but that's not a problem. Maybe the only problem associated with this is that df1 needs to be converted to a set first, so that will remove duplicates in df1 which might not be desired. Thanks a lot! – r_31415 Aug 12 '13 at 16:48
yes set per definition creates an unordered data structure. Not sure if there is a pure pandas alternative, but seeing that the index data is not important to you it makes sense to use the built in methods. Be sure to check out @Jeff solution. He is a real pandas expert and might be quicker than my code. Also staying in pandas data structure will preserve the index data, which might still be of use to you. – Joop Aug 12 '13 at 20:33
1

Unfortunately, I found a small bug. If you're dealing with numbers that require a reasonable precision, this will cause some issues because it rounds to the nearest decimal. In my case, it's the 7th decimal that gets rounded. – r_31415 Aug 14 '13 at 03:13
What is the computational complexity of this? This seems unacceptably slow, no? – Scott Skiles Apr 02 '19 at 13:29
This is extremely slow. Something that uses compiled C (ie numpy or pandas) would be way better – CodingTil Nov 24 '22 at 20:08

radream · Answer 2 · 2021-03-08T08:15:09.967

59

Here's another answer that keeps the index and does not require identical indexes in two data frames. (EDIT: make sure there is no duplicates in df2 beforehand)

pd.concat([df2, df1, df1]).drop_duplicates(keep=False)

It is fast and the result is

   col1  col2
0     4     6
2     5     5

edited Mar 08 '21 at 08:15

answered Oct 24 '16 at 02:11

radream

781
6
10

2

That's clever. Of course, it doesn't have the flexibility of a set theoretic function. – r_31415 Oct 26 '16 at 18:34
FYI link is broken – Scott Skiles Apr 02 '19 at 13:28
2

That answer implies that there is any duplicated rows in df2 (df2 is already at set of rows) – guilloptero Oct 25 '20 at 22:02
1

@guilloptero: you could modify by dropping any duplicates in df2 before running – Bryan P Jan 13 '21 at 19:53
1

@BryanP's suggestion works: `pd.concat([df2.drop_duplicates(), df1, df1]).drop_duplicates(keep=False)` – Ozgur Ozturk Nov 22 '22 at 17:09

score 36 · Accepted Answer · edited May 15 '20 at 13:57

36

from pandas import  DataFrame

df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})


print(df2[~df2.isin(df1).all(1)])
print(df2[(df2!=df1)].dropna(how='all'))
print(df2[~(df2==df1)].dropna(how='all'))

edited May 15 '20 at 13:57

user

5,335
7
47
63

answered Dec 20 '14 at 10:15

Shahriar

13,460
8
78
95

8

I believe this solution only works if this indexes are identical on both DataFrames. @Joop has an answer that works without this condition. – Alex Petralia May 27 '16 at 16:59
5

I wish this answer was explained a little more for those of us trying to follow along. I think that all three answers are the same, but they give different data types in the results. I think the ~ negates the operation, but what is the purpose of .all(1)? – David May 10 '17 at 23:40
1

@David, .all(axis) - axis : {0, 1}. 0 for row-wise, 1 for column-wise.So to check if all columns (axis=1) are null.. .all(1) is used. check result without .all(1) – Shahriar May 13 '17 at 12:06
The first does not work if you have Nones in your data. Example: `df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,None,4]})` `df2 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,None,4]})` `print(df2[~df2.isin(df1).all(1)])` – Jeff Mar 16 '20 at 17:47

score 5 · Answer 4 · answered Aug 10 '17 at 18:10

There are 3 methods which work, but two of them have some flaws.

Method 1 (Hash method):

It worked for all cases I tested.

df1.loc[:, "hash"] = df1.apply(lambda x: hash(tuple(x)), axis = 1)
df2.loc[:, "hash"] = df2.apply(lambda x: hash(tuple(x)), axis = 1)
df1 = df1.loc[~df1["hash"].isin(df2["hash"]), :]

Method 2 (Dict method):

It fails if DataFrames contain datetime columns.

df1 = df1.loc[~df1.isin(df2.to_dict(orient="list")).all(axis=1), :]

Method 3 (MultiIndex method):

I encountered cases when it failed on columns with None's or NaN's.

df1 = df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)

Jeff · Answer 5 · 2013-08-14T15:33:20.170

4

Apply by the columns of the object you want to map (df2); find the rows that are not in the set (isin is like a set operator)

In [32]: df2.apply(lambda x: df2.loc[~x.isin(df1[x.name]),x.name])
Out[32]: 
   col1  col2
0     4     6
2     5     5

Same thing, but include all values in df1, but still per column in df2

In [33]: df2.apply(lambda x: df2.loc[~x.isin(df1.values.ravel()),x.name])
Out[33]: 
   col1  col2
0   NaN     6
2     5     5

2nd example

In [34]: g = pd.DataFrame({'x': [1.2,1.5,1.3], 'y': [4,4,4]})

In [35]: g.columns=df1.columns

In [36]: g
Out[36]: 
   col1  col2
0   1.2     4
1   1.5     4
2   1.3     4

In [32]: g.apply(lambda x: g.loc[~x.isin(df1[x.name]),x.name])
Out[32]: 
   col1  col2
0   1.2   NaN
1   1.5   NaN
2   1.3   NaN

Note, in 0.13, there will be an isin operator on the frame level, so something like: df2.isin(df1) should be possible

edited Aug 14 '13 at 15:33

answered Aug 12 '13 at 13:00

Jeff

125,376
21
220
187

Interesting. For some reason, I couldn't make it work. It returned a lot of NaN's, so I need to see what went wrong. – r_31415 Aug 12 '13 at 16:57
I think there is a problem. If you have something like this: `g = pd.DataFrame({'x': [1.2,1.5,1.3], 'y': [4,4,4]})` and you want to remove `g.ix[[0,2]]`, then this won't select the correct rows because it checks if every element of g is in g.ix[[0,2]]. The second row includes a number (4) that is equal to one of g.ix[[0,2]] (actually, two of them are equal). – r_31415 Aug 14 '13 at 03:08
I am not clear on your comment; what would you expected as the result of ``g.set_diff(df1)`` (or the reverse) – Jeff Aug 14 '13 at 13:20
1

g.set_diff(df1) should return the rows in g that are not included in df1. – r_31415 Aug 14 '13 at 15:29

Ian Kent · Answer 6 · 2019-11-07T14:30:03.350

Edit: You can now make MultiIndex objects directly from data frames as of pandas 0.24.0 which greatly simplifies the syntax of this answer

df1mi = pd.MultiIndex.from_frame(df1)
df2mi = pd.MultiIndex.from_frame(df2)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)

Original Answer

Pandas MultiIndex objects have fast set operations implemented as methods, so you can convert the DataFrames to MultiIndexes, use the difference() method, then convert the result back to a DataFrame. This solution should be much faster (by ~100x or more from my brief testing) than the solutions given here so far, and it will not depend on the row indexing of the original frames. As Piotr mentioned for his answer, this will fail with null values, since np.nan != np.nan. Any row in df2 with a null value will always appear in the difference. Also, the columns should be in the same order for both DataFrames.

df1mi = pd.MultiIndex.from_arrays(df1.values.transpose(), names=df1.columns)
df2mi = pd.MultiIndex.from_arrays(df2.values.transpose(), names=df2.columns)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)

score 4 · Answer 7 · answered Jan 22 '20 at 10:21

Numpy's setdiff1d would work and perhaps be faster.

For each column: np.setdiff1(df1.col1.values, df2.col1.values)

So something like:

setdf = pd.DataFrame({
    col: np.setdiff1d(getattr(df1, col).values, getattr(df2, col).values)
    for col in df1.columns
})

numpy.setdiff1d docs

score 3 · Answer 8 · answered Aug 20 '16 at 18:23

Get the indices of the intersection with a merge, then drop them:

>>> df_all = pd.DataFrame(np.arange(8).reshape((4,2)), columns=['A','B']); df_all
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
>>> df_completed = df_all.iloc[::2]; df_completed
   A  B
0  0  1
2  4  5
>>> merged = pd.merge(df_all.reset_index(), df_completed); merged
   index  A  B
0      0  0  1
1      2  4  5
>>> df_pending = df_all.drop(merged['index']); df_pending
   A  B
1  2  3
3  6  7

Jacek Pliszka · Answer 9 · 2017-11-17T23:40:20.140

2

Assumption:

df1 and df2 have identical columns

it is a set operation so duplicates are ignored

sets are not extremely large so you do not worry about memory

union = pd.concat([df1,df2])
sym_diff = union[~union.duplicated(keep=False)]
union_of_df1_and_sym_diff = pd.concat([df1, sym_diff])
diff = union_of_df1_and_sym_diff[union_of_df1_and_sym_diff.duplicated()]

edited Nov 17 '17 at 23:40

answered Nov 17 '17 at 23:32

Jacek Pliszka

21
3

score 1 · Answer 10 · answered Aug 07 '17 at 20:49

I'm not sure how pd.concat() implicitly joins overlapping columns but I had to do a little tweak on @radream's answer.

Conceptually, a set difference (symmetric) on multiple columns is a set union (outer join) minus a set intersection (or inner join):

df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
o = pd.merge(df1, df2, how='outer')
i = pd.merge(df1, df2)
set_diff = pd.concat([o, i]).drop_duplicates(keep=False)

This yields:

   col1  col2
0     1     2
2     3     4
3     4     6
4     5     5

score 1 · Answer 11 · answered Sep 20 '21 at 08:47

In Pandas 1.1.0 you can count unique rows with value_counts and find difference between counts:

df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

diff = df2.value_counts().sub(df1.value_counts(), fill_value=0)

Result:

col1  col2
1     2      -1.0
2     3       0.0
3     4      -1.0
4     6       1.0
5     5       1.0
dtype: float64

Get positive counts:

diff[diff > 0].reset_index(name='counts')


   col1  col2  counts
0     4     6     1.0
1     5     5     1.0

score 0 · Answer 12 · answered Mar 07 '23 at 19:41

The easiest way I know of is to use pd.merge with how='outer' and indicator=True

df3 = pd.merge(df1,df2,how='outer',left_on=['col1'],right_on=['col1'],indicator=True)

The resulting table will contain an new column _merge with the values right_only, left_only and both that you can then filter out. I.E.

df3[df3['_merge']=='left_only']

The upside to this approach is flexibility in that you can use ordered lists of columns for your definition of equality across the two tables. I.E. left_on=['col1','col2'],right_on=['col1','col3'].

Or if you meant something different by set difference you can use df3[~(df3['_merge']=='both')] or df3[~(df3['_merge']=='right_only')] or what ever matches your needs.

The downside is that you are doing a full outer join on the two tables and you may end up with a messy table requiring extra cleanup.

score -1 · Answer 13 · answered Jun 26 '19 at 10:57

this should work even if you have multiple columns in both dataframes. But make sure that the column names of both the dataframes are the exact same.

set_difference = pd.concat([df2, df1, df1]).drop_duplicates(keep=False)

With multiple columns you can also use:

col_names=['col_1','col_2']
set_difference = pd.concat([df2[col_names], df1[col_names], 
df1[col_names]]).drop_duplicates(keep=False)

set difference for pandas

13 Answers13

Linked

Related