I have two sets of pandas dataframes:
df_old:
Drug_Name Special_Code Tier bold italic
0 abc None T9 FALSE TRUE
1 por None T9 TRUE FALSE
2 ASD None T9 FALSE TRUE
3 bhj None T9 TRUE FALSE
4 JLL None T9 FALSE TRUE
5 hhj None T2 TRUE FALSE
6 YUI None T1 FALSE TRUE
7 jjj None T1 FALSE TRUE
8 IOP None T2 FALSE TRUE
9 lkh None T9 FALSE TRUE
10 jjs None T7 FALSE TRUE
11 LHH R T6 TRUE FALSE
12 kjk None T7 TRUE FALSE
df_new:
Drug_Name Special_Code Tier bold italic
0 ABC None T9 FALSE TRUE
1 por Z T9 TRUE FALSE
2 ASD None T8 FALSE TRUE
3 bhj None T9 FALSE FALSE
4 JLL None T9 FALSE FALSE
5 HHJ None T1 TRUE FALSE
6 YUI A T1 TRUE TRUE
7 jjj None T1 FALSE TRUE
8 IOP None T2 FALSE TRUE
9 lkh None T9 FALSE TRUE
10 YUH I T3 TRUE FALSE
11 jjk None T8 TRUE TRUE
12 GGJ None T7 FALSE TRUE
13 kkl R T6 TRUE FALSE
14 gvb None T7 TRUE FALSE
I am looking for getting the information similar to that shown in 'CHANGE_FLAG' in the sample Output below:
df_old df_new
Drug_Name Special_Code Tier bold italic Drug_Name Special_Code Tier bold italic CHANGE_FLAG
0 abc None T9 FALSE TRUE 0 ABC None T9 FALSE TRUE Drug_Name Changed
1 por None T9 TRUE FALSE 1 por Z T9 TRUE FALSE Special_Code Changed
2 ASD None T9 FALSE TRUE 2 ASD None T8 FALSE TRUE Tier Changed
3 bhj None T9 TRUE FALSE 3 bhj None T9 FALSE FALSE bold Changed
4 JLL None T9 FALSE TRUE 4 JLL None T9 FALSE FALSE italic Changed
5 hhj None T2 TRUE FALSE 5 HHJ None T1 TRUE FALSE Drug_Name and Tier Changed
6 YUI None T1 FALSE TRUE 6 YUI A T1 TRUE TRUE Special_Code and bold Changed
7 jjj None T1 FALSE TRUE 7 jjj None T1 FALSE TRUE 100% Match
8 IOP None T2 FALSE TRUE 8 IOP None T2 FALSE TRUE 100% Match
9 lkh None T9 FALSE TRUE 9 lkh None T9 FALSE TRUE 100% Match
10 jjs None T7 FALSE TRUE No Match
11 LHH R T6 TRUE FALSE No Match
12 kjk None T7 TRUE FALSE No Match
10 YUH I T3 TRUE FALSE No Match
11 jjk None T8 TRUE TRUE No Match
12 GGJ None T7 FALSE TRUE No Match
13 kkl R T6 TRUE FALSE No Match
14 gvb None T7 TRUE FALSE No Match
I am looking for getting the information in the 'CHANGE_FLAG'. The issue is that we want to identify which particular column has changed and also if there are a combination of columns that have changed, those need to be flagged separately. "One key rule is that drug name can only change in case (small case to caps or vice versa)". I thought of creating a combination of hash values for different columns and then using those to perform merge. Something like this:
# convert 'bold' and 'italic' fields to numeric
df_old['bold'] = [1 if val else 0 for val in df_old['bold']]
df_old['italic'] = [1 if val else 0 for val in df_old['italic']]
df_new['bold'] = [1 if val else 0 for val in df_new['bold']]
df_new['italic'] = [1 if val else 0 for val in df_new['italic']]
# hash with 'Drug_Name' alone
df_old['Hash_Drug'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name']]
df_new['Hash_Drug'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name']]
# hash with case-insensitive (uppercase) 'Drug_Name' alone
df_old['Hash_Drug_Caps'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'].str.upper()]
df_new['Hash_Drug_Caps'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'].str.upper()]
# hash with 'Drug_Name' and 'Special_Cdoe' combined # .astype(str)
df_old['Hash_Drug_Code'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'] + '_' + df_old['Special_Code']]
df_new['Hash_Drug_Code'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'] + '_' + df_new['Special_Code']]
# hash with 'Drug_Name' and 'Tier' combined
df_old['Hash_Drug_Tier'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'] + '_' + df_old['Tier']]
df_new['Hash_Drug_Tier'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'] + '_' + df_new['Tier']]
# hash with 'Drug_Name', 'Special_Cdoe', 'Tier' combined
df_old['Hash_Drug_Code_Tier'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'] + '_' + df_old['Special_Code'] + '_' + df_old['Tier']]
df_new['Hash_Drug_Code_Tier'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'] + '_' + df_new['Special_Code'] + '_' + df_new['Tier']]
# hash with all columns combined except 'italic'
df_old['Hash_Not_Italic'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'] + '_' + df_old['Special_Code'] + '_' + df_old['Tier'] + '_' + df_old['bold'].astype(str)]
df_new['Hash_Not_Italic'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'] + '_' + df_new['Special_Code'] + '_' + df_new['Tier'] + '_' + df_new['bold'].astype(str)]
# hash with all columns combined except 'bold'
df_old['Hash_Not_Bold'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'] + '_' + df_old['Special_Code'] + '_' + df_old['Tier'] + '_' + df_old['italic'].astype(str)]
df_new['Hash_Not_Bold'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'] + '_' + df_new['Special_Code'] + '_' + df_new['Tier'] + '_' + df_new['italic'].astype(str)]
# hash with all columns combined
df_old['Hash_All'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'] + '_' + df_old['Special_Code'] + '_' + df_old['Tier'] + '_' + df_old['bold'].astype(str) + '_' + df_old['italic'].astype(str)]
df_new['Hash_All'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'] + '_' + df_new['Special_Code'] + '_' + df_new['Tier'] + '_' + df_new['bold'].astype(str) + '_' + df_new['italic'].astype(str)]
# same as 'Hash_All' but with 'Drug_Name' caps
df_old['Hash_All_Caps'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_old['Drug_Name'].str.upper() + '_' + df_old['Special_Code'] + '_' + df_old['Tier'] + '_' + df_old['bold'].astype(str) + '_' + df_old['italic'].astype(str)]
df_new['Hash_All_Caps'] = [hashlib.md5(str(value).strip().encode('utf-8')).hexdigest() for value in df_new['Drug_Name'].str.upper() + '_' + df_new['Special_Code'] + '_' + df_new['Tier'] + '_' + df_new['bold'].astype(str) + '_' + df_new['italic'].astype(str)]
And then generate separate dataframes based on hash values:
# finding matched records with 'Drug_Name' alone match - Using 'Hash_Drug'
df_drug_match = pd.merge(df_old, df_new, on=['Hash_Drug'], how='inner', suffixes=('_left','_right'), indicator=True)[['index_left', 'index_right', 'Drug_Name_left', 'Drug_Name_right', 'Special_Code_left', 'Special_Code_right', 'Tier_left', 'Tier_right', 'bold_right', 'bold_left', 'italic_right', 'italic_left']]
# finding matched records with 'Drug_Name' alone case-insensitive match - Using 'Hash_Drug_Caps'
df_drug_caps_match = pd.merge(df_old, df_new, on=['Hash_Drug_Caps'], how='inner', suffixes=('_left','_right'), indicator=True)[['index_left', 'index_right', 'Drug_Name_left', 'Drug_Name_right', 'Special_Code_left', 'Special_Code_right', 'Tier_left', 'Tier_right', 'bold_right', 'bold_left', 'italic_right', 'italic_left']]
# finding matched records with 'Drug_Name' and 'Special_Code' match - Using 'Hash_Drug_Code'
df_drug_code_match = pd.merge(df_old, df_new, on=['Hash_Drug_Code'], how='inner', suffixes=('_left','_right'), indicator=True)[['index_left', 'index_right', 'Drug_Name_left', 'Drug_Name_right', 'Special_Code_left', 'Special_Code_right', 'Tier_left', 'Tier_right', 'bold_right', 'bold_left', 'italic_right', 'italic_left']]
# finding matched records with 'Drug_Name' and 'Tier' match - Using 'Hash_Drug_Tier'
df_drug_tier_match = pd.merge(df_old, df_new, on=['Hash_Drug_Tier'], how='inner', suffixes=('_left','_right'), indicator=True)[['index_left', 'index_right', 'Drug_Name_left', 'Drug_Name_right', 'Special_Code_left', 'Special_Code_right', 'Tier_left', 'Tier_right', 'bold_right', 'bold_left', 'italic_right', 'italic_left']]
# finding matched records with case insensitive matching for 'Drug_Name' - Using 'Hash_All_Caps'
df_all_caps_match = pd.merge(df_old, df_new, on=['Hash_All_Caps'], how='inner', suffixes=('_left','_right'), indicator=True)[['index_left', 'index_right', 'Drug_Name_left', 'Drug_Name_right', 'Special_Code_left', 'Special_Code_right', 'Tier_left', 'Tier_right', 'bold_right', 'bold_left', 'italic_right', 'italic_left']]
# finding matched records with all three columns matching - Using 'Hash_All'
df_all_match = pd.merge(df_old, df_new, on=['Hash_All'], how='inner', suffixes=('_left','_right'), indicator=True)[['index_left', 'index_right', 'Drug_Name_left', 'Drug_Name_right', 'Special_Code_left', 'Special_Code_right', 'Tier_left', 'Tier_right', 'bold_right', 'bold_left', 'italic_right', 'italic_left']]
# finding disjoint (non-matching records)
df_outer_caps_match = pd.merge(df_old, df_new, on=['Hash_Drug_Caps'], how='outer', suffixes=('_left','_right'), indicator=True)
The record counts in old and new dataframes are not huge:
len(df_old.index) #6673
len(df_new.index) #6725
However, this code is far from being efficient and crashing.
Looking for some suggestions on how to achieve what I want in an efficinet and scalable way. Many thanks in advance.