4

Say I have two dataframes

df1
df2

that I can join on df1_keys and df2_keys.

I would like to do:

  1. (A-B)
  2. (A-B) U (B-A)

with A=df1 and B=df2.

From what I read on the documentation, the how argument for pd.merge supports the following options:

how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
        left: use only keys from left frame (SQL: left outer join)
        right: use only keys from right frame (SQL: right outer join)
        outer: use union of keys from both frames (SQL: full outer join)
        inner: use intersection of keys from both frames (SQL: inner join)

but none of them gives us directly the set operations 1 and 2 above.

For reference, below is the corresponding reference for SQL (from this thread):

enter image description here

Community
  • 1
  • 1
Josh
  • 11,979
  • 17
  • 60
  • 96
  • 1
    I've rolled this back to the previous version, which I think is a great question. Please ask the specific aspect separately, happy to answer it! :) – Andy Hayden Feb 28 '14 at 23:24
  • 2
    Thanks @AndyHayden I wholeheartedly agree with your edits. For anyone interested, here is a related question: [Set differences on columns between dataframes](http://stackoverflow.com/questions/22109361/set-differences-on-columns-between-dataframes) – Josh Mar 01 '14 at 02:31

1 Answers1

4

Although these aren't supported directly, they can be achieved by tweaking with the indexes before attempting the join...

You can do set minus using the - operator:

In [11]: ind = pd.Index([1, 2, 3])

In [12]: ind2 = pd.Index([3, 4, 5])

In [13]: ind - ind2
Out[13]: Int64Index([1, 2], dtype='int64')

and set union with the | and intersection with &:

In [14]: ind | ind2
Out[14]: Int64Index([1, 2, 3, 4, 5], dtype='int64')

In [15]: ind & ind2
Out[15]: Int64Index([3], dtype='int64')

So if you have some DataFrames with these indexes, you can reindex before you join:

In [21]: df = pd.DataFrame(np.random.randn(3), ind, ['a'])  # ind = df.index

In [22]: df2 = pd.DataFrame(np.random.randn(3), ind2, ['b'])  # ind2 = df2.index

In [23]: df.reindex(ind & ind2)
Out[23]:
          a
3  1.368518

So now you can build up whatever join you want:

In [24]: df.reindex(ind & ind2).join(df2.reindex(ind & ind2))  # equivalent to inner
Out[24]:
          a         b
3  1.368518 -1.335534

In [25]: df.reindex(ind - ind2).join(df2.reindex(ind - ind2))  # join on A set minus B
Out[25]:
          a   b
1  1.193652 NaN
2  0.064467 NaN
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 1
    How would you extend this to doing the algebra on columns and not on indices? (i.e. I mean doing the set difference on a particular column, as in the diagram in the OP). I thought I could simply do `df.set_index(columns_for_the_join)` before I join, but some of the entries in the columns are `NaN`, and when I convert this column to indices, it fails, since I can't later index my Dataframe with `ind_diff = ind - ind2` if `ind` (and therefore `ind_diff`) have NaN values. – Josh Feb 28 '14 at 20:32
  • @Josh columns are also an `Index`, so you can do the same set operations with `df.columns`. Not sure I understand what you're asking, you should be able to join on indexes with NaNs... Perhaps ask this as a new question? – Andy Hayden Feb 28 '14 at 21:21
  • Thanks @Andy I greatly appreciate your help. I have updated the OP with an example to clarify what I mean. – Josh Feb 28 '14 at 23:14
  • 1
    @Josh I think that should be a new question. The original question was a good one you shouldn't change it! – Andy Hayden Feb 28 '14 at 23:22