3

I have two pandas data-frame and each of them are of different sizes each over 1 million records. I am looking to compare these two data-frames and identify the differences.

DataFrameA

ID   Name    Age  Sex
1A1  Cling   21    M
1B2  Roger   22    M
1C3  Stew    23    M

DataFrameB

ID   FullName   Gender  Age
1B2  Roger       M       21
1C3  Rick        M       23
1D4  Ash         F       21

DataFrameB will always have more records than DataFrameA but the records found in DataFrameA may not still be in DataFrameB. The column names in the DataFrameA and DataFrameB are different. I have the mapping stored in a different dataframe.

MappingDataFrame

DataFrameACol   DataFrameBCol
ID               ID
Name             FullName
Age              Age
Sex              Gender

I am looking to compare these two and add a column next to it with the result.

Col Name Adder for DataFrame A = "_A_Txt"

Col Name Adder for DataFrame B = "_B_Txt"

ExpectedOutput

ID   Name_A_Txt FullName_B_Text   Result_Name   Age_A_Txt  Age_B_Txt   Result_Age     
1B2  Roger           Roger          Match        ...        ...
1C3  Stew            Rick           No Match     ...        ...

The column names should have the text added before this.

I am using a For loop at the moment to build this logic. But 1 million record is taking ages to complete. I left the program running for more than 50 minutes and it wasn't completed as in real-time, I am building it for more than 100 columns.

I will open bounty for this question and award the bounty, even if the question was answered before opening it as a reward. As, I have been struggling really for performance using For loop iteration.

To start with DataFrameA and DataFrameB, use the below code,

import pandas as pd

d = {
     'ID':['1A1', '1B2', '1C3'], 
     'Name':['Cling', 'Roger', 'Stew'],
     'Age':[21, 22, 23], 
     'Sex':['M', 'M', 'M']
     }

DataFrameA = pd.DataFrame(d)

d = {
     'ID':['1B2', '1C3', '1D4'], 
     'FullName':['Roger', 'Rick', 'Ash'],
     'Gender':['M', 'M', 'F'], 
     'Age':[21, 23, 21]
     }

DataFrameB = pd.DataFrame(d)

I believe, this question is a bit different from the suggestion (definition on joins) provided by Coldspeed as this also involves looking up at different column names and adding a new result column along. Also, the column names need to be transformed on the result side.

The OutputDataFrame Looks as below,

For better understanding of the readers, I am putting the column names in the Row in order

Col 1 -  ID (Coming from DataFrameA)
Col 2 -  Name_X (Coming from DataFrameA)
Col 3 -  FullName_Y (Coming from DataFrameB)
Col 4 -  Result_Name (Name is what is there in DataFrameA and this is a comparison between Name_X and FullName_Y)
Col 5 -  Age_X (Coming from DataFrameA)
Col 6 -  Age_Y (Coming From DataFrameB)
Col 7 -  Result_Age (Age is what is there in DataFrameA and this is a result between Age_X and Age_Y)
Col 8 -  Sex_X (Coming from DataFrameA)
Col 9 -  Gender_Y (Coming from DataFrameB)
Col 10 - Result_Sex (Sex is what is there in DataFrameA and this is a result between Sex_X and Gender_Y)
cs95
  • 379,657
  • 97
  • 704
  • 746
  • This looks like a merge problem, no? `df1.merge(df2, how='inner', on=['ID'], suffixes=('_A', '_B'))` – cs95 Jan 17 '19 at 21:26
  • You can read up on INNER JOINs at [this link](https://stackoverflow.com/questions/53645882/pandas-merging-101). If you're having any trouble getting it to work, let us know. – cs95 Jan 17 '19 at 21:27
  • Use your mapping DataFrame to map the column names before you merge, otherwise the suffixes won't be as helpful. – ALollz Jan 17 '19 at 21:28
  • Yes inner join on ID will work. But the column names are different right? But even while assigning the column names and identifying the match or no-match , I am finding it difficult without the for loop & for loop is too slow..... – Student of the Digital World Jan 17 '19 at 21:29
  • @ALollz that's what I am trying to do but failing. – Student of the Digital World Jan 17 '19 at 21:53
  • @coldspeed Kindly don't mark this as duplicate. I am looking for a solution, this might be a basic one for experts but it is too difficult to learn quickly from the link that you have mentioned, where it doesn't have mapping different column name concepts for compare and transforming column names on the outputs. – Student of the Digital World Jan 17 '19 at 21:55
  • I extrapolated the core of the link into my first comment, do you take a look? I told you to use `df1.merge(df2, how='inner', on=['ID'], suffixes=('_A', '_B'))`. From there, all you need to do is just change the column names. Doing post processing on the columns does not change the core question here, the answer to which is merge. – cs95 Jan 17 '19 at 21:57
  • @coldspeed I did. But unfortunately, couldn't still understand how I will be able to compare Name and FullName. They are completely different name, how can I do that dynamically... Poor me. – Student of the Digital World Jan 17 '19 at 22:01
  • 1
    @Sid29 I was not clear on exactly what you would have wanted, so I just threw out the answer. Is there anything missing from it? – cs95 Jan 17 '19 at 22:08
  • @coldspeed Yes, but looking at the answer. I wanted to have a result_Name column, result_Age column, result_sex Column. I don't prefer adding another line to find result_Age. This is going to change dynamic and I will have more than 100 columns to compare. So prefer a function. I also don't want to go back and re-order the output dataframe as it will again affect performance. The usual order, the ID column, the first column in DataFrameA, the relevant column in DataFrameB, the Result_Column then the next column in DataFrame A..followed by the relevant column...then the result column – Student of the Digital World Jan 17 '19 at 22:10
  • @coldspeed I updated the question with the expected output format of the dataframe. Can you please take a look and see if the answer can be updated. Thank you. – Student of the Digital World Jan 17 '19 at 22:34

1 Answers1

2
m = list(mapping_df.set_index('DataFrameACol')['DataFrameBCol']
                   .drop('ID')
                   .iteritems())
m[m.index(('Age', 'Age'))] = ('Age_x', 'Age_y')
m 
# [('Name', 'FullName'), ('Age_x', 'Age_y'), ('Sex', 'Gender')]

Start with an inner merge:

df3 = (df1.merge(df2, how='inner', on=['ID'])
          .reindex(columns=['ID', *(v for V in m for v in V)]))

df3
    ID   Name FullName  Age_x  Age_y Sex Gender
0  1B2  Roger    Roger     22     21   M      M
1  1C3   Stew     Rick     23     23   M      M

Now, compare the columns and set values with np.where:

l, r = map(list, zip(*m))
matches = (df3[l].eq(df3[r].rename(dict(zip(r, l)), axis=1))
                 .add_prefix('Result_')
                 .replace({True: 'Match', False: 'No Match'}))

for k, v in m:
    name = f'Result_{k}'
    df3.insert(df3.columns.get_loc(v)+1, name, matches[name])

df3.columns
# Index(['ID', 'Name', 'FullName', 'Result_Name', 'Age_x', 'Age_y',
#        'Result_Age_x', 'Sex', 'Gender', 'Result_Sex'],
#       dtype='object')

df3.filter(like='Result_')

  Result_Name Result_Age_x Result_Sex
0       Match     No Match      Match
1    No Match        Match      Match
cs95
  • 379,657
  • 97
  • 704
  • 746