Identify column wise difference in pandas dataframe

Question

I have two sets of pandas dataframes:

df_old:

    Drug_Name   Special_Code    Tier    bold    italic
0   abc None    T9  FALSE   TRUE
1   por None    T9  TRUE    FALSE
2   ASD None    T9  FALSE   TRUE
3   bhj None    T9  TRUE    FALSE
4   JLL None    T9  FALSE   TRUE
5   hhj None    T2  TRUE    FALSE
6   YUI None    T1  FALSE   TRUE
7   jjj None    T1  FALSE   TRUE
8   IOP None    T2  FALSE   TRUE
9   lkh None    T9  FALSE   TRUE
10  jjs None    T7  FALSE   TRUE
11  LHH R       T6  TRUE    FALSE
12  kjk None    T7  TRUE    FALSE

df_new:

    Drug_Name   Special_Code    Tier    bold    italic
0   ABC None    T9  FALSE   TRUE
1   por Z       T9  TRUE    FALSE
2   ASD None    T8  FALSE   TRUE
3   bhj None    T9  FALSE   FALSE
4   JLL None    T9  FALSE   FALSE
5   HHJ None    T1  TRUE    FALSE
6   YUI A       T1  TRUE    TRUE
7   jjj None    T1  FALSE   TRUE
8   IOP None    T2  FALSE   TRUE
9   lkh None    T9  FALSE   TRUE
10  YUH I       T3  TRUE    FALSE
11  jjk None    T8  TRUE    TRUE
12  GGJ None    T7  FALSE   TRUE
13  kkl R       T6  TRUE    FALSE
14  gvb None    T7  TRUE    FALSE

I am looking for getting the information similar to that shown in 'CHANGE_FLAG' in the sample Output below:

    df_old                          df_new                      
    Drug_Name   Special_Code    Tier    bold    italic          Drug_Name   Special_Code    Tier    bold    italic      CHANGE_FLAG
0   abc None    T9  FALSE   TRUE        0   ABC None    T9  FALSE   TRUE        Drug_Name Changed
1   por None    T9  TRUE    FALSE       1   por Z       T9  TRUE    FALSE       Special_Code Changed
2   ASD None    T9  FALSE   TRUE        2   ASD None    T8  FALSE   TRUE        Tier Changed
3   bhj None    T9  TRUE    FALSE       3   bhj None    T9  FALSE   FALSE       bold Changed
4   JLL None    T9  FALSE   TRUE        4   JLL None    T9  FALSE   FALSE       italic Changed
5   hhj None    T2  TRUE    FALSE       5   HHJ None    T1  TRUE    FALSE       Drug_Name and Tier Changed
6   YUI None    T1  FALSE   TRUE        6   YUI A       T1  TRUE    TRUE        Special_Code and bold Changed
7   jjj None    T1  FALSE   TRUE        7   jjj None    T1  FALSE   TRUE        100% Match
8   IOP None    T2  FALSE   TRUE        8   IOP None    T2  FALSE   TRUE        100% Match
9   lkh None    T9  FALSE   TRUE        9   lkh None    T9  FALSE   TRUE        100% Match
10  jjs None    T7  FALSE   TRUE                                                No Match
11  LHH R       T6  TRUE    FALSE                                               No Match
12  kjk None    T7  TRUE    FALSE                                               No Match
                                        10  YUH I       T3  TRUE    FALSE       No Match
                                        11  jjk None    T8  TRUE    TRUE        No Match
                                        12  GGJ None    T7  FALSE   TRUE        No Match
                                        13  kkl R       T6  TRUE    FALSE       No Match
                                        14  gvb None    T7  TRUE    FALSE       No Match

I am looking for getting the information in the 'CHANGE_FLAG'. The issue is that we want to identify which particular column has changed and also if there are a combination of columns that have changed, those need to be flagged separately. "One key rule is that drug name can only change in case (small case to caps or vice versa)". I thought of creating a combination of hash values for different columns and then using those to perform merge. Something like this:

# convert 'bold' and 'italic' fields to numeric
    df_old['bold'] = [1 if val else 0 for val in df_old['bold']]
    df_old['italic'] = [1 if val else 0 for val in df_old['italic']]

    df_new['bold'] = [1 if val else 0 for val in df_new['bold']]
    df_new['italic'] = [1 if val else 0 for val in df_new['italic']]

    # hash with 'Drug_Name' alone
    df_old['Hash_Drug'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name']]
    df_new['Hash_Drug'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name']]

    # hash with case-insensitive (uppercase) 'Drug_Name' alone
    df_old['Hash_Drug_Caps'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'].str.upper()]
    df_new['Hash_Drug_Caps'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'].str.upper()]

    # hash with 'Drug_Name' and 'Special_Cdoe' combined # .astype(str)
    df_old['Hash_Drug_Code'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'] + '_' + df_old['Special_Code']]
    df_new['Hash_Drug_Code'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'] + '_' + df_new['Special_Code']]

    # hash with 'Drug_Name' and 'Tier' combined
    df_old['Hash_Drug_Tier'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'] + '_' + df_old['Tier']]
    df_new['Hash_Drug_Tier'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'] + '_' + df_new['Tier']]

    # hash with 'Drug_Name', 'Special_Cdoe', 'Tier' combined
    df_old['Hash_Drug_Code_Tier'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'] + '_' + df_old['Special_Code'] + '_' + df_old['Tier']]
    df_new['Hash_Drug_Code_Tier'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'] + '_' + df_new['Special_Code'] + '_' + df_new['Tier']]

    # hash with all columns combined except 'italic'
    df_old['Hash_Not_Italic'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'] + '_' + df_old['Special_Code'] + '_' + df_old['Tier'] + '_' + df_old['bold'].astype(str)]
    df_new['Hash_Not_Italic'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'] + '_' + df_new['Special_Code'] + '_' + df_new['Tier'] + '_' + df_new['bold'].astype(str)]

    # hash with all columns combined except 'bold'
    df_old['Hash_Not_Bold'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'] + '_' + df_old['Special_Code'] + '_' + df_old['Tier'] + '_' + df_old['italic'].astype(str)]
    df_new['Hash_Not_Bold'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'] + '_' + df_new['Special_Code'] + '_' + df_new['Tier'] + '_' + df_new['italic'].astype(str)]

    # hash with all columns combined
    df_old['Hash_All'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'] + '_' + df_old['Special_Code'] + '_' + df_old['Tier'] + '_' + df_old['bold'].astype(str) + '_' + df_old['italic'].astype(str)]
    df_new['Hash_All'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'] + '_' + df_new['Special_Code'] + '_' + df_new['Tier'] + '_' + df_new['bold'].astype(str) + '_' + df_new['italic'].astype(str)]

    # same as 'Hash_All' but with 'Drug_Name' caps
    df_old['Hash_All_Caps'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'].str.upper() + '_' + df_old['Special_Code'] + '_' + df_old['Tier'] + '_' + df_old['bold'].astype(str) + '_' + df_old['italic'].astype(str)]
    df_new['Hash_All_Caps'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'].str.upper()  + '_' + df_new['Special_Code'] + '_' + df_new['Tier'] + '_' + df_new['bold'].astype(str) + '_' + df_new['italic'].astype(str)]

And then generate separate dataframes based on hash values:

# finding matched records with 'Drug_Name' alone match - Using 'Hash_Drug'
df_drug_match = pd.merge(df_old, df_new, on=['Hash_Drug'], how='inner', suffixes=('_left','_right'), indicator=True)[['index_left', 'index_right', 'Drug_Name_left', 'Drug_Name_right', 'Special_Code_left', 'Special_Code_right', 'Tier_left', 'Tier_right', 'bold_right', 'bold_left', 'italic_right', 'italic_left']]

# finding matched records with 'Drug_Name' alone case-insensitive match - Using 'Hash_Drug_Caps'
df_drug_caps_match = pd.merge(df_old, df_new, on=['Hash_Drug_Caps'], how='inner', suffixes=('_left','_right'), indicator=True)[['index_left', 'index_right', 'Drug_Name_left', 'Drug_Name_right', 'Special_Code_left', 'Special_Code_right', 'Tier_left', 'Tier_right', 'bold_right', 'bold_left', 'italic_right', 'italic_left']]

# finding matched records with 'Drug_Name' and 'Special_Code' match - Using 'Hash_Drug_Code'
df_drug_code_match = pd.merge(df_old, df_new, on=['Hash_Drug_Code'], how='inner', suffixes=('_left','_right'), indicator=True)[['index_left', 'index_right', 'Drug_Name_left', 'Drug_Name_right', 'Special_Code_left', 'Special_Code_right', 'Tier_left', 'Tier_right', 'bold_right', 'bold_left', 'italic_right', 'italic_left']]

# finding matched records with 'Drug_Name' and 'Tier' match - Using 'Hash_Drug_Tier'
df_drug_tier_match = pd.merge(df_old, df_new, on=['Hash_Drug_Tier'], how='inner', suffixes=('_left','_right'), indicator=True)[['index_left', 'index_right', 'Drug_Name_left', 'Drug_Name_right', 'Special_Code_left', 'Special_Code_right', 'Tier_left', 'Tier_right', 'bold_right', 'bold_left', 'italic_right', 'italic_left']]

# finding matched records with case insensitive matching for 'Drug_Name' - Using 'Hash_All_Caps'
df_all_caps_match = pd.merge(df_old, df_new, on=['Hash_All_Caps'], how='inner', suffixes=('_left','_right'), indicator=True)[['index_left', 'index_right', 'Drug_Name_left', 'Drug_Name_right', 'Special_Code_left', 'Special_Code_right', 'Tier_left', 'Tier_right', 'bold_right', 'bold_left', 'italic_right', 'italic_left']]

# finding matched records with all three columns matching - Using 'Hash_All'
df_all_match = pd.merge(df_old, df_new, on=['Hash_All'], how='inner', suffixes=('_left','_right'), indicator=True)[['index_left', 'index_right', 'Drug_Name_left', 'Drug_Name_right', 'Special_Code_left', 'Special_Code_right', 'Tier_left', 'Tier_right', 'bold_right', 'bold_left', 'italic_right', 'italic_left']]

# finding disjoint (non-matching records)
df_outer_caps_match = pd.merge(df_old, df_new, on=['Hash_Drug_Caps'], how='outer', suffixes=('_left','_right'), indicator=True)

The record counts in old and new dataframes are not huge:

len(df_old.index) #6673
len(df_new.index) #6725

However, this code is far from being efficient and crashing.

Looking for some suggestions on how to achieve what I want in an efficinet and scalable way. Many thanks in advance.

score 0 · Answer 1 · answered Nov 30 '20 at 06:05

0

There may be a better solution, but here's what comes to my mind. First, calculate the inequality between the dataframes:

diff = (df_new != df_old)

Prepare a function that will process each row of the new dataframe. It selects the names of the columns that differ and combines them into a string.

def prepare_message(row):
    row_message = " and ".join(row.index[row])
    return f"{row_message} Changed" if row_message else ""

Finally, apply the function to the difference dataframe and fix the rows that do not see any difference:

flags = diff.apply(prepare_message, axis=1).replace("", "100% Match")
#0                Drug_Name Changed
#1             Special_Code Changed
#2                     Tier Changed
#3                     bold Changed
#4                   italic Changed
#5       Drug_Name and Tier Changed
#6    Special_Code and bold Changed
#7                       100% Match
#8                       100% Match
#9                       100% Match

answered Nov 30 '20 at 06:05

DYZ

55,249
10
64
93

hi @DYZ. Many thanks for a prompt response. When I try this: diff = (df_new.sort_index(inplace=True) != df_old.sort_index(inplace=True)), 'diff' is just a boolean value 'False' not a difference dataframe. – Anshul Nov 30 '20 at 06:18
Also, 'Change_Flag' column is not actually needed, but just dataframe subsets for difference combinations. I added this to the question just to add more clarity to the question. – Anshul Nov 30 '20 at 06:19
Why do you sort on the index? This is wrong. – DYZ Nov 30 '20 at 06:23
If i dont do that, I get this: ValueError: Can only compare identically-labeled DataFrame objects. I referred to this post: https://stackoverflow.com/questions/18548370/pandas-can-only-compare-identically-labeled-dataframe-objects-error – Anshul Nov 30 '20 at 06:24
This means that your example is wrong. My code works on your example and gives the expected answer. You cannot get an answer that you expect if you do not provide adequate data. – DYZ Nov 30 '20 at 06:26
I understand @DYZ. My data-frames are 6.5k records each and has confidential data. But structurally its the same as I provided in the question – Anshul Nov 30 '20 at 06:48
I added more details for clarity. The two dataframes have different number of rows. the new dataframe can have new records as well as deleted some records when compared to old (df_old). – Anshul Nov 30 '20 at 07:09

Identify column wise difference in pandas dataframe

1 Answers1