I have two Python Pandas Dataframe like below:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K1', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K0', 'K0', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6'],
'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
print left
A B key1 key2
0 A0 B0 K0 K0
1 A1 B1 K0 K1
2 A2 B2 K1 K0
3 A3 B3 K1 K0
4 A4 B4 K1 K0
5 A5 B5 K1 K0
6 A6 B6 K2 K1
print right
C D key1 key2
0 C0 D0 K0 K0
1 C1 D1 K1 K0
2 C2 D2 K1 K0
3 C3 D3 K2 K0
I want to merge this two Dataframe on certain condition:
- I want to merge this dataframes on key1 and key2.
- I want output Dataframe with only rows available in the left dataframe. (If the key1 and key2 of left Dataframe is available in the key1 and key2 of the right Dataframe then fulfill the current row by appending column C and D)
I tried like below:
result = pd.merge(left, right, on=['key1', 'key2'],how = 'left')
print result
# Output - which is not expected for me
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K1 K0 C1 D1
5 A3 B3 K1 K0 C2 D2
6 A4 B4 K1 K0 C1 D1
7 A4 B4 K1 K0 C2 D2
8 A5 B5 K1 K0 C1 D1
9 A5 B5 K1 K0 C2 D2
10 A6 B6 K2 K1 NaN NaN
If I drop the duplicate entries from the left Dataframe then also I didn't get the expected output.
Then, I tried to remove duplicate entries from the result Dataframe then I got below Dataframe:
result.drop_duplicates(subset = ['A','B','key1','key2'], inplace=True)
print result
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
4 A3 B3 K1 K0 C1 D1
6 A4 B4 K1 K0 C1 D1
8 A5 B5 K1 K0 C1 D1
10 A6 B6 K2 K1 NaN NaN
The problem is column C and D contains same values which is because of default keep=First behavior. I want to change this behavior so that I can get below expected Dataframe.
Edit :
If the key1 and key2 of left Dataframe match in the key1 and key2 of right Dataframe then append C and D columns to that row and If column C and D are already appended to the same row then keep C and D as Nan.
Expected Output:
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A3 B3 K1 K0 C2 D2
4 A4 B4 K1 K0 NaN Nan
5 A5 B5 K1 K0 NaN Nan
6 A6 B6 K2 K1 NaN NaN