Background: I have 2 files, with few matching columns and there is a common column between them for comparison.
Example:
Table 1 , about 10K rows
| col1 | col2 | col3 |
|------|------|------|
| adam | key1 | def |
| mike | key2 | efg |
Table 2 , about 5k rows
| col1 | col2 | col3 | col4 |
|------|------|------|------|
| adam | key1 | def | abc |
| mike | key2 | | cdf |
Now I am trying to get data from first file, compare it with the second one and if it is different, create a new file to get the complete row from first file. So in above example, taking the first column as comparison column, for adam, col1, 2, 3 are same so no update record required. for mike, col2 value will is same, but the col3 has different value so it will be part of the update file.
I am using python and this currently is too basic and I cannot seem to find a more efficient way for it
df_1 = pd.read_excel("file1.xlsx")
df_2 = pd.read_excel("file2.xlsx")
df_1 = df_1.fillna('')
df_2 = df_2.fillna('')
df_3 = pd.DataFrame()
for i in range(0, len(df_1.index)):
for j in range(0,len(df_2.index)):
if(df_1[col1].iloc[i] == df_2[col1].iloc[j]):
flag = 0
if(df_1[col2].iloc[i] != df_2[col2].iloc[j]): flag = 1
if(df_1[col3].iloc[i] != df_2[col3].iloc[j]): flag = 1
if(flag): df_3 = df_3.append(df_1.iloc[[i]])
break;
writer = ExcelWriter("update.xlsx")
df_3.to_excel(writer,index=False,header=True)
I tried few variations like below, was trying to use something along lines of match on common row using the col1 value and then check if other two column values are same of different and return that, but it does not return all required data
df_3 = df_1[(
df_1['col1'].isin(df_2['col1']) &
(~df_1['col2'].isin(df_2['col2'])
| ~df_1['col3'].isin(df_2['col3'])
)
)]
Output:
col1 col2 col3
0 mike key2 efg