Trying to compare 2 excel files via python and pandas

Question

Background: I have 2 files, with few matching columns and there is a common column between them for comparison.

Example:

Table 1 , about 10K rows
| col1 | col2 | col3 |
|------|------|------|
| adam | key1 | def  |
| mike | key2 | efg  |

Table 2 , about 5k rows
| col1 | col2 | col3 | col4 |
|------|------|------|------|
| adam | key1 | def  | abc  |
| mike | key2 |      | cdf  |

Now I am trying to get data from first file, compare it with the second one and if it is different, create a new file to get the complete row from first file. So in above example, taking the first column as comparison column, for adam, col1, 2, 3 are same so no update record required. for mike, col2 value will is same, but the col3 has different value so it will be part of the update file.

I am using python and this currently is too basic and I cannot seem to find a more efficient way for it


df_1 = pd.read_excel("file1.xlsx")
df_2 = pd.read_excel("file2.xlsx")

df_1 = df_1.fillna('')
df_2 = df_2.fillna('')

df_3 = pd.DataFrame()

for i in range(0, len(df_1.index)):
    for j in range(0,len(df_2.index)):
        if(df_1[col1].iloc[i] == df_2[col1].iloc[j]):
            flag = 0
            if(df_1[col2].iloc[i] != df_2[col2].iloc[j]): flag = 1
            if(df_1[col3].iloc[i] != df_2[col3].iloc[j]): flag = 1
            if(flag): df_3 = df_3.append(df_1.iloc[[i]])
            break;

writer = ExcelWriter("update.xlsx")
df_3.to_excel(writer,index=False,header=True)

I tried few variations like below, was trying to use something along lines of match on common row using the col1 value and then check if other two column values are same of different and return that, but it does not return all required data

df_3 = df_1[(
            df_1['col1'].isin(df_2['col1']) & 
                (~df_1['col2'].isin(df_2['col2']) 
                | ~df_1['col3'].isin(df_2['col3']) 
                ) 
            )]

Output:

   col1  col2 col3
0  mike  key2  efg

How did mike get `def`? Do you mean `efg`? – Chris Sep 20 '19 at 10:39 — Chris, Sep 20 '19 at 10:39
fat fingers..yes the data from first frame as is – user5338274 Sep 20 '19 at 10:41 — user5338274, Sep 20 '19 at 10:41

Chris · Accepted Answer · 2019-09-20T10:41:20.470

0

IIUC, use pandas.DataFrame.update:

Given:

# df1
   col1  col2 col3
0  adam  key1  def
1  mike  key2  efg
# df2
   col1  col2 col3 col4
0  adam  key1  def  abc
1  mike  key2       cdf


new_df = df1.set_index('col1')
new_df.update(df2.set_index('col1'))
new_df.reset_index(inplace=True)
print(new_df)

Output:

   col1  col2 col3
0  adam  key1  def
1  mike  key2

To get the updated row(s) (in their original state):

df1[(new_df != df1).any(1)]

Output:

   col1  col2 col3
1  mike  key2  efg

edited Sep 20 '19 at 10:41

answered Sep 20 '19 at 10:30

Chris

29,127
3
28
51

Just edited the post, yup only the line with mike as output and ignore the adam line – user5338274 Sep 20 '19 at 10:39
2 issues first is, getting "ValueError: Can only compare identically-labeled DataFrame objects", i tried https://stackoverflow.com/questions/18548370/pandas-can-only-compare-identically-labeled-dataframe-objects-error, does not seem to help. second is, new_df.update(df2.set_index('col1')) did a quick read of its documentation, it updates as per df2 into new_df , so output will be "mike, key2, " – user5338274 Sep 20 '19 at 11:05

Trying to compare 2 excel files via python and pandas

1 Answers1