How to find the intersection of a pair of columns in multiple pandas dataframes with pairs in any order?

Question

I have multiple pandas dataframes, to keep it simple, let's say I have three.

   >> df1=
       col1  col2
   id1  A     B  
   id2  C     D  
   id3  B     A  
   id4  E     F  


    >> df2=
       col1  col2
   id1  B     A  
   id2  D     C  
   id3  M     N  
   id4  F     E  

    >> df3=
       col1  col2
   id1  A     B  
   id2  D     C  
   id3  N     M  
   id4  E     F

The result needed is :

    >> df=
       col1  col2
   id1  A     B
   id2  C     D
   id3  E     F

Because the pairs (A, B),(C, D),(E, F) appear in all the data frames although it may be reversed.

While using pandas merge it just considers the way columns are passed. To check my observation I tried the following code for two data frames:

df1['reverse_1'] = (df1.col1+df1.col2).isin(df2.col1 + df2.col2)

df1['reverse_2'] = (df1.col1+df1.col2).isin(df2.col2 + df2.col1)

And I found that the results differ:

col1    col2    reverse_1   reverse_2
 a        b       False      True
 c        d       False      True
 b        a       True       False
 e        f       False      True

So, if I collect 'True' values from both reverse_1 and reverse_2 columns, I can get the intersect of both the data frames. Even if I do it for two data frames it's not clear to me how to proceed with more data frames (more than two). I am little confused about that. Any suggestions?

There are 4 columns but as I needed to compare the two columns and copy the rest of the data from other columns. — Ashutosh, Nov 28 '18 at 14:23
Please look at the three data frames [df1,df2,df3]. You will see that the pair (A, B) appears in all of them. But it's (B, A) in df2. Same is the case with pairs (C, D) and (E, F). So I need to find the common pairs of elements in all the data frames where elements can occur in any order, (A, B) or (B, A) — Ashutosh, Nov 28 '18 at 14:29

jezrael · Accepted Answer · 2018-11-28T14:43:08.897

5

You can create list of DataFrames and in list comprehension sorting per rows with removing duplicates:

dfs = [df1,df2,df3]

L = [pd.DataFrame(np.sort(x.values, axis=1), columns=x.columns).drop_duplicates() 
     for x in dfs]
print (L)
[  col1 col2
0    A    B
1    C    D
3    E    F,   col1 col2
0    A    B
1    C    D
2    M    N
3    E    F,   col1 col2
0    A    B
1    C    D
2    M    N
3    E    F]

And then merge list of DataFrames by all columns (no parameter on):

from functools import reduce
df = reduce(lambda left,right: pd.merge(left,right), L)
print (df)
  col1 col2
0    A    B
1    C    D
2    E    F

Another solution for @pygo:

Create index by frozensets and join together by concat with inner join, last remove duplicates by index by duplicated with boolean indexing and iloc for get first 2 columns:

df = pd.concat([x.set_index(x.apply(frozenset, axis=1)) for x in dfs], axis=1, join='inner')
df = df.iloc[~df.index.duplicated(), :2]
print (df)
       col1 col2
(B, A)    A    B
(C, D)    C    D
(F, E)    E    F

edited Nov 28 '18 at 14:43

answered Nov 28 '18 at 14:16

jezrael

822,522
95
1,334
1,252

@pygo This will simply append all the columns side by side. While if axis=0 then it will stack the column elements. But this doesn't do what is intended. I am working with the answer given by "jezrael " – Ashutosh Nov 28 '18 at 14:26
Okay, hope you will get solution from @jezrael's answer – Karn Kumar Nov 28 '18 at 14:27
1

@jezrael Elegant is the only word to this solution. By the way, I am inspired by your activeness on this forum and depth of knowledge as well. Can you add a little explanation on the first part of the code? – Ashutosh Nov 28 '18 at 14:36
@pygo - I create solution for you with `frozenset`s ;) – jezrael Nov 28 '18 at 14:38
@jezrael , you made my day :-) thnx for alternate solutions ! another learning +1 , added thes solutions to my note. – Karn Kumar Nov 28 '18 at 14:38
1

@Ashutosh - sure, you can sorting each row of DataFrame by `np.sort` and assign to DataFrame from numpy array for possible call function `DataFrame.drop_duplicates()`.this solution is call in list comprehension for each DataFrame of list of DataFrames. – jezrael Nov 28 '18 at 14:40

score 1 · Answer 2 · answered Nov 28 '18 at 16:49

Somewhat similar to some of the earlier answers.

import pandas as pd
from io import StringIO 

# Test data
df1 = pd.read_table(StringIO ("""
id col1 col2
id1  A     B
id2  C     D
id3  B     A
id4  E     F
"""), delim_whitespace = True)
df2 = pd.read_table(StringIO ("""
id col1 col2
id1  B     A  
id2  D     C  
id3  M     N  
id4  F     E  
"""), delim_whitespace = True)
df3 = pd.read_table(StringIO("""
id col1 col2
id1  A     B  
id2  D     C  
id3  N     M  
id4  E     F 
"""), delim_whitespace = True)

# List of n dataframes
dfs = [df1, df2, df3]

# Use frozenset to define the column values without regard for order 
# pandas apply iterates over each row
# list expression iterates over each dataframe
combined_columns = [pd.Series(df.apply(lambda r: frozenset((r.col1, r.col2)), axis=1), name = 'combined') for df in dfs]
print(combined_columns)
# Results in  alist of Series named 'combined'
#[0    (B, A)
# 1    (D, C)
# 2    (B, A)
# 3    (F, E)
# Name: combined, dtype: object, 
# 0    (B, A)
# 1    (D, C)
# 2    (N, M)
# 3    (E, F)
# Name: combined, dtype: object, 
# 0    (B, A)
# 1    (D, C)
# 2    (M, N)
# 3    (F, E)
# Name: combined, dtype: object]

dfs_combined = [pd.concat([dfs[i], combined_columns[i]], axis = 1) for i in range(len(dfs))]
print(dfs_combined)
# Result in a list of dataframes with the extra columns
#[    id col1 col2 combined
# 0  id1    A    B   (B, A)
# 1  id2    C    D   (D, C)
# 2  id3    B    A   (B, A)
# 3  id4    E    F   (F, E),     
#     id col1 col2 combined
# 0  id1    B    A   (B, A)
# 1  id2    D    C   (D, C)
# 2  id3    M    N   (N, M)
# 3  id4    F    E   (E, F),
#     id col1 col2 combined
# 0  id1    A    B   (B, A)
# 1  id2    D    C   (D, C)
# 2  id3    N    M   (M, N)
# 3  id4    E    F   (F, E)]

# The reduce function operates on pairs, with previous result as the first argument 
from functools import reduce
result = reduce(lambda df1, df2: df1[df1['combined'].isin(df2['combined'])], dfs_combined).drop_duplicates(subset='combined')
print(result)
#    id col1 col2 combined
#0  id1    A    B   (B, A)
#1  id2    C    D   (D, C)
#3  id4    E    F   (F, E)

How to find the intersection of a pair of columns in multiple pandas dataframes with pairs in any order?

2 Answers2

Linked