How to merge two pandas dataframe with left join when both dataframe contains duplicate keys?

Question

I have two Python Pandas Dataframe like below:

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K1', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K1', 'K0', 'K0', 'K0', 'K0', 'K1'],
                      'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6'],
                      'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                       'key2': ['K0', 'K0', 'K0', 'K0'],
                       'C': ['C0', 'C1', 'C2', 'C3'],
                       'D': ['D0', 'D1', 'D2', 'D3']})
print left
    A   B key1 key2
0  A0  B0   K0   K0
1  A1  B1   K0   K1
2  A2  B2   K1   K0
3  A3  B3   K1   K0
4  A4  B4   K1   K0
5  A5  B5   K1   K0
6  A6  B6   K2   K1

print right
    C   D key1 key2
0  C0  D0   K0   K0
1  C1  D1   K1   K0
2  C2  D2   K1   K0
3  C3  D3   K2   K0

I want to merge this two Dataframe on certain condition:

I want to merge this dataframes on key1 and key2.
I want output Dataframe with only rows available in the left dataframe. (If the key1 and key2 of left Dataframe is available in the key1 and key2 of the right Dataframe then fulfill the current row by appending column C and D)

I tried like below:

result = pd.merge(left, right, on=['key1', 'key2'],how = 'left')
print result

# Output - which is not expected for me
     A   B key1 key2    C    D
0   A0  B0   K0   K0   C0   D0
1   A1  B1   K0   K1  NaN  NaN
2   A2  B2   K1   K0   C1   D1
3   A2  B2   K1   K0   C2   D2
4   A3  B3   K1   K0   C1   D1
5   A3  B3   K1   K0   C2   D2
6   A4  B4   K1   K0   C1   D1
7   A4  B4   K1   K0   C2   D2
8   A5  B5   K1   K0   C1   D1
9   A5  B5   K1   K0   C2   D2
10  A6  B6   K2   K1  NaN  NaN

If I drop the duplicate entries from the left Dataframe then also I didn't get the expected output.

Then, I tried to remove duplicate entries from the result Dataframe then I got below Dataframe:

result.drop_duplicates(subset = ['A','B','key1','key2'], inplace=True)
print result

     A   B key1 key2    C    D
0   A0  B0   K0   K0   C0   D0
1   A1  B1   K0   K1  NaN  NaN
2   A2  B2   K1   K0   C1   D1
4   A3  B3   K1   K0   C1   D1
6   A4  B4   K1   K0   C1   D1
8   A5  B5   K1   K0   C1   D1
10  A6  B6   K2   K1  NaN  NaN

The problem is column C and D contains same values which is because of default keep=First behavior. I want to change this behavior so that I can get below expected Dataframe.

Edit :

If the key1 and key2 of left Dataframe match in the key1 and key2 of right Dataframe then append C and D columns to that row and If column C and D are already appended to the same row then keep C and D as Nan.

Expected Output:

     A   B key1 key2    C    D
0   A0  B0   K0   K0   C0   D0
1   A1  B1   K0   K1  NaN  NaN
2   A2  B2   K1   K0   C1   D1
3   A3  B3   K1   K0   C2   D2
4   A4  B4   K1   K0   NaN  Nan
5   A5  B5   K1   K0   NaN  Nan
6   A6  B6   K2   K1  NaN  NaN

`df1` has four rows with `K1` and `K0`. `df2` has two such rows. Your expected results consist of four such rows that match `df1` where we alternate through available rows from `df1`. What happens if there were 3 such rows in `df1` and two such rows in `df2` — piRSquared, Oct 27 '17 at 09:12
@piRSquared- Actually column C and D is important for me so I want to take all entries from column C and D which are unique for particular match. In that case (3 rows in df1 case), first 2 rows should take C1 and D1 and third row take C2 and D2. — ketan, Oct 27 '17 at 09:16

score 2 · Answer 1 · answered Jun 24 '18 at 18:14

Unfortunately, your question does not describe what you want to achieve in a way that it could become useful to anyone with a similar problem.

Indeed, you wanted to obtain a sorted merge for repeated merging keys.

The logical way to proceed is add a sequence number to make the merging multiple key unique. Then what follows is a trivial merge.

left['Order'] = left.groupby(['key1','key2']).cumcount()
right['Order'] = right.groupby(['key1','key2']).cumcount()

result = left.merge(right, how='left', 
                    on=['key1','key2','Order']).drop('Order',axis=1)

This is the exactly what I want. __groupby__ and **cumcount** is the key! — agfe2, Jan 03 '23 at 06:17

How to merge two pandas dataframe with left join when both dataframe contains duplicate keys?

1 Answers1