I have two pandas data-frame and each of them are of different sizes each over 1 million records. I am looking to compare these two data-frames and identify the differences.
DataFrameA
ID Name Age Sex
1A1 Cling 21 M
1B2 Roger 22 M
1C3 Stew 23 M
DataFrameB
ID FullName Gender Age
1B2 Roger M 21
1C3 Rick M 23
1D4 Ash F 21
DataFrameB will always have more records than DataFrameA but the records found in DataFrameA may not still be in DataFrameB. The column names in the DataFrameA and DataFrameB are different. I have the mapping stored in a different dataframe.
MappingDataFrame
DataFrameACol DataFrameBCol
ID ID
Name FullName
Age Age
Sex Gender
I am looking to compare these two and add a column next to it with the result.
Col Name Adder for DataFrame A = "_A_Txt"
Col Name Adder for DataFrame B = "_B_Txt"
ExpectedOutput
ID Name_A_Txt FullName_B_Text Result_Name Age_A_Txt Age_B_Txt Result_Age
1B2 Roger Roger Match ... ...
1C3 Stew Rick No Match ... ...
The column names should have the text added before this.
I am using a For loop at the moment to build this logic. But 1 million record is taking ages to complete. I left the program running for more than 50 minutes and it wasn't completed as in real-time, I am building it for more than 100 columns.
I will open bounty for this question and award the bounty, even if the question was answered before opening it as a reward. As, I have been struggling really for performance using For loop iteration.
To start with DataFrameA and DataFrameB, use the below code,
import pandas as pd
d = {
'ID':['1A1', '1B2', '1C3'],
'Name':['Cling', 'Roger', 'Stew'],
'Age':[21, 22, 23],
'Sex':['M', 'M', 'M']
}
DataFrameA = pd.DataFrame(d)
d = {
'ID':['1B2', '1C3', '1D4'],
'FullName':['Roger', 'Rick', 'Ash'],
'Gender':['M', 'M', 'F'],
'Age':[21, 23, 21]
}
DataFrameB = pd.DataFrame(d)
I believe, this question is a bit different from the suggestion (definition on joins) provided by Coldspeed as this also involves looking up at different column names and adding a new result column along. Also, the column names need to be transformed on the result side.
The OutputDataFrame Looks as below,
For better understanding of the readers, I am putting the column names in the Row in order
Col 1 - ID (Coming from DataFrameA)
Col 2 - Name_X (Coming from DataFrameA)
Col 3 - FullName_Y (Coming from DataFrameB)
Col 4 - Result_Name (Name is what is there in DataFrameA and this is a comparison between Name_X and FullName_Y)
Col 5 - Age_X (Coming from DataFrameA)
Col 6 - Age_Y (Coming From DataFrameB)
Col 7 - Result_Age (Age is what is there in DataFrameA and this is a result between Age_X and Age_Y)
Col 8 - Sex_X (Coming from DataFrameA)
Col 9 - Gender_Y (Coming from DataFrameB)
Col 10 - Result_Sex (Sex is what is there in DataFrameA and this is a result between Sex_X and Gender_Y)