Intersection of two pandas dataframes based on column entries

Question

Suppose I have two DataFrames like so:

>>dfA
        S        T            prob
0   ! ! !    ! ! !   8.1623999e-05
1   ! ! !  ! ! ! "   0.00354090007
2   ! ! !  ! ! ! .   0.00210241997
3   ! ! !  ! ! ! ?  6.55684998e-05
4   ! ! !  ! ! !       0.203119993
5   ! ! !  ! ! ! ”  6.62070015e-05
6   ! ! !      ! !   0.00481862016
7   ! ! !        !    0.0274260994
8   ! ! !  " ! ! !  7.99940026e-05
9   ! ! !      " !  1.51188997e-05
10  ! ! !        "  8.50678989e-05

>>dfB
        S        T                              knstats
0   ! ! !  ! ! ! !              knstats=2,391,104,64,25
1   ! ! !  ! ! ! "                 knstats=4,391,6,64,2
2   ! ! !  ! ! ! .                 knstats=4,391,5,64,2
3   ! ! !  ! ! ! ?                 knstats=1,391,4,64,4
4   ! ! !    ! ! !            knstats=220,391,303,64,55
5   ! ! !      ! !            knstats=16,391,957,64,115
6   ! ! !        !           knstats=28,391,5659,64,932
7   ! ! !  " ! ! !                 knstats=2,391,2,64,1
8   ! ! !      " !               knstats=1,391,37,64,13
9   ! ! !        "  knstats=2,391,1.11721e+06,64,180642
10  ! !        . "        knstats=2,391,120527,64,20368

I want to create a new DataFrame which is composed of the rows which have matching "S" and "T" entries in both matrices, along with the prob column from dfA and the knstats column from dfB. The result should look something like the following, and it is important that the order is the same:

       S        T            prob                              knstats
0  ! ! !  ! ! ! !   8.1623999e-05              knstats=2,391,104,64,25
1  ! ! !  ! ! ! "   0.00354090007                 knstats=4,391,6,64,2
2  ! ! !  ! ! ! .   0.00210241997                 knstats=4,391,5,64,2
3  ! ! !  ! ! ! ?  6.55684998e-05                 knstats=1,391,4,64,4
4  ! ! !    ! ! !     0.203119993            knstats=220,391,303,64,55
5  ! ! !      ! !   0.00481862016            knstats=16,391,957,64,115
6  ! ! !        !    0.0274260994           knstats=28,391,5659,64,932
7  ! ! !  " ! ! !  7.99940026e-05                 knstats=2,391,2,64,1
8  ! ! !      " !  1.51188997e-05               knstats=1,391,37,64,13
9  ! ! !        "  8.50678989e-05  knstats=2,391,1.11721e+06,64,180642

user308827 · Accepted Answer · 2014-11-14T02:42:48.390

46

You can merge them so:

s1 = pd.merge(dfA, dfB, how='inner', on=['S', 'T'])

To drop NA rows:

s1.dropna(inplace=True)

edited Nov 14 '14 at 02:42

answered Nov 14 '14 at 02:31

user308827

21,227
87
254
417

2

I am not interested in simply merging them, but taking the intersection. That is, if there is a row where 'S' and 'T' do not have both prob and knstats, I want to get rid of that row. You'll notice that dfA and dfB do not match up exactly. However, this seems like a good first step. How can I prune the rows with NaN values in either prob or knstats in the output matrix? – Bib Nov 14 '14 at 02:38
Changed to how='inner', that will compute the intersection based on 'S' an 'T' – user308827 Nov 14 '14 at 02:44
Also, you can use dropna to drop rows with any NaN's. Edited my answer – user308827 Nov 14 '14 at 02:45
7

This is a join, not an intersection – lfk Nov 09 '17 at 04:15
1

by definition: an intersection == an equality join on all columns – user__42 Jun 04 '18 at 21:59

score 1 · Answer 2 · answered Apr 07 '23 at 19:58

To select the rows from the two dataframes where S and T values intersect, one could use Index.intersection(). The idea is to find the common rows (i.e. intersection) in S and T columns in the two dataframes and select only those rows.

# convert S and T columns into MultiIndex
dfA_idx = pd.MultiIndex.from_frame(dfA[['S', 'T']])
dfB_idx = pd.MultiIndex.from_frame(dfB[['S', 'T']])
# get intersecting rows in S and T
common_idx = dfA_idx.intersection(dfB_idx)
# filter rows where S and T intersect
dfA_common = dfA.set_index(['S', 'T']).reindex(common_idx).reset_index()
dfB_common = dfB.set_index(['S', 'T']).reindex(common_idx).reset_index()

The concatenation of these frames will produce the same result as merge(), i.e.

x = dfA_common.join(dfB_common[['knstats']])
y = dfA.merge(dfB, on=['S', 'T'])
x.equals(y)                                    # True

Intersection of two pandas dataframes based on column entries

2 Answers2

Linked