2

I am trying to select rows in a pandas data frame based on it's values matching those of another data frame. Crucially, I only want to match values in rows, not throughout the whole series. For example:

df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})

df2 = pd.DataFrame({'a':[3, 2, 1], 'b':[4, 5, 6]})

I want to select rows where both 'a' and 'b' values from df1 match any row in df2. I have tried:

df1[(df1['a'].isin(df2['a'])) & (df1['b'].isin(df2['b']))]

This of course returns all rows, as the all values are present in df2 at some point, but not necessarily the same row. How can I limit this so the values tested for 'b' are only those rows where the value 'a' was found? So with the example above, I am expecting only row index 1 ([2, 5]) to be returned.

Note that data frames may be of different shapes, and contain multiple matching rows.

ssast
  • 779
  • 1
  • 8
  • 17

2 Answers2

4

Similar to this post, here's one using broadcasting -

df1[(df1.values == df2.values[:,None]).all(-1).any(0)]

The idea is :

1) Use np.all for the both part in ""both 'a' and 'b' values"".

2) Use np.any for the any part in "from df1 match any row in df2".

3) Use broadcasting for doing all these in a vectorized fashion by extending dimensions with None/np.newaxis.

Sample run -

In [41]: df1
Out[41]: 
   a  b
0  1  4
1  2  5
2  3  6

In [42]: df2  # Modified to add another row : [1,4] for variety
Out[42]: 
   a  b
0  3  4
1  2  5
2  1  6
3  1  4

In [43]: df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
Out[43]: 
   a  b
0  1  4
1  2  5
Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • This works, and I never would have worked this out myself, thank you. I'm surprised there's no existing pandas function to perform this operation. – ssast Nov 16 '16 at 18:49
  • @ssast Well even NumPy doesn't have a built-in for this. There's a very relevant [`Q&A`](http://stackoverflow.com/questions/38674027/find-the-row-indexes-of-several-values-in-a-numpy-array) that researches other more efficient solutions to find row indices. Do have a good look there! – Divakar Nov 16 '16 at 18:53
  • What is the `None` for? – FaCoffee Apr 30 '18 at 12:42
  • @FaCoffee Info - https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html#numpy.newaxis – Divakar Apr 30 '18 at 12:51
0

use numpy broadcasting

pd.DataFrame((df1.values[:, None] == df2.values).all(2),
             pd.Index(df1.index, name='df1'),
             pd.Index(df2.index, name='df2'))

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624