0

Background: I have 2 files, with few matching columns and there is a common column between them for comparison.

Example:

Table 1 , about 10K rows
| col1 | col2 | col3 |
|------|------|------|
| adam | key1 | def  |
| mike | key2 | efg  |
Table 2 , about 5k rows
| col1 | col2 | col3 | col4 |
|------|------|------|------|
| adam | key1 | def  | abc  |
| mike | key2 |      | cdf  |

Now I am trying to get data from first file, compare it with the second one and if it is different, create a new file to get the complete row from first file. So in above example, taking the first column as comparison column, for adam, col1, 2, 3 are same so no update record required. for mike, col2 value will is same, but the col3 has different value so it will be part of the update file.

I am using python and this currently is too basic and I cannot seem to find a more efficient way for it


df_1 = pd.read_excel("file1.xlsx")
df_2 = pd.read_excel("file2.xlsx")

df_1 = df_1.fillna('')
df_2 = df_2.fillna('')

df_3 = pd.DataFrame()

for i in range(0, len(df_1.index)):
    for j in range(0,len(df_2.index)):
        if(df_1[col1].iloc[i] == df_2[col1].iloc[j]):
            flag = 0
            if(df_1[col2].iloc[i] != df_2[col2].iloc[j]): flag = 1
            if(df_1[col3].iloc[i] != df_2[col3].iloc[j]): flag = 1
            if(flag): df_3 = df_3.append(df_1.iloc[[i]])
            break;

writer = ExcelWriter("update.xlsx")
df_3.to_excel(writer,index=False,header=True)

I tried few variations like below, was trying to use something along lines of match on common row using the col1 value and then check if other two column values are same of different and return that, but it does not return all required data

df_3 = df_1[(
            df_1['col1'].isin(df_2['col1']) & 
                (~df_1['col2'].isin(df_2['col2']) 
                | ~df_1['col3'].isin(df_2['col3']) 
                ) 
            )]

Output:

   col1  col2 col3
0  mike  key2  efg   

1 Answers1

0

IIUC, use pandas.DataFrame.update:

Given:

# df1
   col1  col2 col3
0  adam  key1  def
1  mike  key2  efg
# df2
   col1  col2 col3 col4
0  adam  key1  def  abc
1  mike  key2       cdf


new_df = df1.set_index('col1')
new_df.update(df2.set_index('col1'))
new_df.reset_index(inplace=True)
print(new_df)

Output:

   col1  col2 col3
0  adam  key1  def
1  mike  key2     

To get the updated row(s) (in their original state):

df1[(new_df != df1).any(1)]

Output:

   col1  col2 col3
1  mike  key2  efg
Chris
  • 29,127
  • 3
  • 28
  • 51
  • Just edited the post, yup only the line with mike as output and ignore the adam line – user5338274 Sep 20 '19 at 10:39
  • 2 issues first is, getting "ValueError: Can only compare identically-labeled DataFrame objects", i tried https://stackoverflow.com/questions/18548370/pandas-can-only-compare-identically-labeled-dataframe-objects-error, does not seem to help. second is, new_df.update(df2.set_index('col1')) did a quick read of its documentation, it updates as per df2 into new_df , so output will be "mike, key2, " – user5338274 Sep 20 '19 at 11:05