2

Note: This question is inspired by the ideas discussed in this other post: DataFrame algebra in Pandas

Say I have two dataframes A and B and that for some column col_name, their values are:

A[col_name]   |  B[col_name]  
--------------| ------------
1             |  3
2             |  4
3             |  5
4             |  6

I want to compute the set difference between A and B based on col_name. The result of this operation should be:

The rows of A where A[col_name] didn't match any entries in B[col_name].

Below is the result for the above example (showing other columns of A as well):

A[col_name] | A[other_column_1] | A[other_column_2]  
------------+-------------------|------------------ 
1           |    'foo'          |  'xyz'            ....
2           |    'bar'          |  'abc'

Keep in mind that some entries in A[col_name] and B[col_name] could hold the value np.NaN. I would like to treat those entries as undefined BUT different, i.e. the set difference should return them.

How can I do this in Pandas? (generalizing to a difference on multiple columns would be great as well)

Community
  • 1
  • 1
Josh
  • 11,979
  • 17
  • 60
  • 96
  • When you say "undefined BUT different", do you mean keep the NaNs from A and even if NaN is/are\ also in B? – Andy Hayden Mar 01 '14 at 02:44
  • 1
    Also thanks for answering this as a separate question :) – Andy Hayden Mar 01 '14 at 02:45
  • Thanks @AndyHayden - I mean that, if we do `A-B` we should only get the NaNs in `A`, and not the NaNs in `B`. Reasoning: Anything in `B` that is undefined can be thought of as unknown to (not present in) in `A`, so it should not make it to `A-B`. Although this could depend on the actual problem and what `NaN` represent. – Josh Mar 01 '14 at 03:07

2 Answers2

5

One way is to use the Series isin method:

In [11]: df1 = pd.DataFrame([[1, 'foo'], [2, 'bar'], [3, 'meh'], [4, 'baz']], columns = ['A', 'B'])

In [12]: df2 = pd.DataFrame([[3, 'a'], [4, 'b']], columns = ['A', 'C'])

Now you can check whether each item in df1['A'] is in of df2['A']:

In [13]: df1['A'].isin(df2['A'])
Out[13]:
0    False
1    False
2     True
3     True
Name: A, dtype: bool

In [14]: df1[~df1['A'].isin(df2['A'])]  # not in df2['A']
Out[14]:
   A    B
0  1  foo
1  2  bar

I think this does what you want for NaNs too:

In [21]: df1 = pd.DataFrame([[1, 'foo'], [np.nan, 'bar'], [3, 'meh'], [np.nan, 'baz']], columns = ['A', 'B'])

In [22]: df2 = pd.DataFrame([[3], [np.nan]], columns = ['A'])

In [23]: df1[~df1['A'].isin(df2['A'])]
Out[23]:
    A     B
0 1.0   foo
1 NaN   bar
3 NaN   baz

Note: For large frames it may be worth making these columns an index (to perform the join as discussed in the other question).

More generally

One way to merge on two or more columns is to use a dummy column:

In [31]: df1 = pd.DataFrame([[1, 'foo'], [np.nan, 'bar'], [4, 'meh'], [np.nan, 'eurgh']], columns = ['A', 'B'])

In [32]: df2 = pd.DataFrame([[np.nan, 'bar'], [4, 'meh']], columns = ['A', 'B'])

In [33]: cols = ['A', 'B']

In [34]: df2['dummy'] = df2[cols].isnull().any(1)  # rows with NaNs in cols will be True

In [35]: merged = df1.merge(df2[cols + ['dummy']], how='left')

In [36]: merged
Out[36]:
    A      B  dummy
0   1    foo    NaN
1 NaN    bar   True
2   4    meh  False
3 NaN  eurgh    NaN

The booleans were present in df2, the True has an NaN in one of the merging columns. Following your spec, we should drop those which are False:

In [37]: merged.loc[merged.dummy != False, df1.columns]
Out[37]:
    A      B
0   1    foo
1 NaN    bar
3 NaN  eurgh

Inelegant.

Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Fantastic! Since your solution depends on the use of `Series`, what would happen if we are doing the difference (i.e. join) on **multiple columns**? – Josh Mar 01 '14 at 03:09
  • Ah, that's trickier, will have to think about this. Ah wait, may be ok with DataFrame isin method (which cares about alignement). – Andy Hayden Mar 01 '14 at 03:12
  • @Josh ok, have a solution. – Andy Hayden Mar 01 '14 at 06:31
  • `AttributeError: 'DataFrame' object has no attribute 'isnull'` – zhangxaochen Mar 01 '14 at 06:44
  • @zhangxaochen make sure you're using an up to date version of pandas (not sure when that method was introduced), the latest stable is 0.13.1 :). Perhaps can use `pd.isnull(df2[cols]).any(1)` in older versions? – Andy Hayden Mar 01 '14 at 06:46
  • @AndyHayden yeah I've upgraded my pandas just now ;) – zhangxaochen Mar 01 '14 at 08:11
  • Thank you @AndyHayden - Sorry for the delay on this. I have provided my own answer to the question, but I am not sure it works. I will review it soon. – Josh Mar 03 '14 at 19:50
0

Here is one option that is also not elegant since it pre-maps the NaN values to some other value (0) so that they can be used as an index:

def left_difference(L, R, L_on, R_on, NULL_VALUE):
  L[L_on] = L[L_on].fillna(NULL_VALUE)
  L.set_index(L_on, inplace=True)

  R[R_on] = R[R_on].fillna(NULL_VALUE)
  R.set_index(R_on, inplace=True)

  # MultiIndex difference:
  diff = L.ix[L.index - R.index]
  diff = diff.reset_index()

  return diff

To make this work peroperly, NULL_VALUE should be a value not used by L_on nor R_on.

Josh
  • 11,979
  • 17
  • 60
  • 96
  • 1
    keep in mind this is also destructive / has side-effects: changes index of L and R and replaces NaNs with NULL_VALUE on those cols. Also diff may have different index – Andy Hayden Mar 03 '14 at 19:54