0

I was comparing two excel files which contains the information of the students of two schools. However those files might contain different number of rows between them.

The first set that I used is to import the excel files in two dataframes:

df1 = pd.read_excel('School A - Information.xlsx')
df2 = pd.read_excel('School B - Information.xlsx')

print(df1)

       Name  Age Birth_Country  Previous Schools
0   tom   10           USA                 3
1  nick   15           MEX                 1
2  juli   14           CAN                 0
3   tom   19           NOR                 1

print(df2)

   Name  Age Birth_Country  Previous Schools
0   tom   10           USA                 3
1   tom   19           NOR                 1
2  nick   15           MEX                 4

After this, I would like to check the divergences between those two dataframes (index order is not important). However I am receiving an error due to the size of the dataframes.

compare = df1.values == df2.values

<ipython-input-9-7cc64ba0e622>:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
  compare = df1.values == df2.values

print(compare)
False

Adding to that, I would like to create a third DataFrame with the corresponding divergences, that shows the divergence.

import numpy as np
rows,cols=np.where(compare==False)

for item in zip(rows,cols):
    df1.iloc[item[0], item[1]] = '{} --> {}'.format(df1.iloc[item[0], item[1]],df2.iloc[item[0], item[1]])

However, using this code is not working, as the index order may be different between the two dataframes.

My expected output should be the below dataframe:

enter image description here

danielssl
  • 29
  • 4
  • One way to do that would be to choose a few rows to serve as the keys that will be used for comparison, so, for example, the row pertaining to `nick` on the first dataframe would be compared to the `nick` on the second one. But, look at `tom`: there are two students named `tom` on the second dataframe, so which would correspond to the `tom` that is present on the first one? You could use a composite key using both `name` and `age`, which would break the tie on the example above, but what would happen if, hipotethically, there were two students named `bob` with the same age in the same class? – Haroldo_OK Oct 25 '21 at 17:32

1 Answers1

0

You can use pd.merge to accomplish this. If you're unfamiliar with dataframe merges, here's a post that describes relational database merging ideas: link. So in this case, what we want to do is first do a left merge of df2 onto df1 to find how the Previous Schools column differs:

df_merged = pd.merge(df1, df2, how="left", on=["Name", "Age", "Birth_Country"], suffixes=["_A", "_B"])
print(df_merged)

will give you a new dataframe

   Name  Age Birth_Country  Previous Schools_A  Previous Schools_B
0   tom   10           USA                   3                 3.0
1  nick   15           MEX                   1                 4.0
2  juli   14           CAN                   0                 NaN
3   tom   19           NOR                   1                 1.0

This new dataframe has all the information you're looking for. To find just the rows where the Previous Schools entries differ:

df_different = df_merged[df_merged["Previous Schools_A"]!=df_merged["Previous Schools_B"]]
print(df_different)
   Name  Age Birth_Country  Previous Schools_A  Previous Schools_B
1  nick   15           MEX                   1                 4.0
2  juli   14           CAN                   0                 NaN

and to find the rows where Previous Schools has not changed:

df_unchanged = df_merged[df_merged["Previous Schools_A"]==df_merged["Previous Schools_B"]]
print(df_unchanged)
  Name  Age Birth_Country  Previous Schools_A  Previous Schools_B
0  tom   10           USA                   3                 3.0
3  tom   19           NOR                   1                 1.0

If I were you, I'd stop here, because creating the final dataframe you want is going to have generic object column types because of the mix of strings and integers, which will limit its uses... but maybe you need it in that particular formattting for some reason. In which case, it's all about putting together these dataframe subsets in the right way to get your desired formatting. Here's one way.

First, initialize the final dataframe with the unchanged rows:

df_final = df_unchanged[["Name", "Age", "Birth_Country", "Previous Schools_A"]].copy()
df_final = df_final.rename(columns={"Previous Schools_A": "Previous Schools"})
print(df_final)
  Name  Age Birth_Country  Previous Schools
0  tom   10           USA                 3
3  tom   19           NOR                 1

now process the entries that have changed between dataframes. There are two cases here: where the entries have changed (where Previous Schools_B is not NaN) and where the entrie is new (where Previous Schools_B is NaN). We'll deal with each in turn:

changed_entries = df_different[~pd.isnull(df_different["Previous Schools_B"])].copy()
changed_entries["Previous Schools"] = changed_entries["Previous Schools_A"].astype('str') + " --> " + changed_entries["Previous Schools_B"].astype('int').astype('str')
changed_entries = changed_entries.drop(columns=["Previous Schools_A", "Previous Schools_B"])
print(changed_entries)
   Name  Age Birth_Country Previous Schools
1  nick   15           MEX          1 --> 4

and now process the entries that are completely new:

new_entries = df_different[pd.isnull(df_different["Previous Schools_B"])].copy()
new_entries = "NaN --> " + new_entries[["Name", "Age", "Birth_Country", "Previous Schools_A"]].astype('str')
new_entries = new_entries.rename(columns={"Previous Schools_A": "Previous Schools"})
print(new_entries)
           Name         Age Birth_Country Previous Schools
2  NaN --> juli  NaN --> 14   NaN --> CAN        NaN --> 0

and finally, concatenate all the dataframes:

df_final = pd.concat([df_final, changed_entries, new_entries])
print(df_final)
           Name         Age Birth_Country Previous Schools
0           tom          10           USA                3
3           tom          19           NOR                1
1          nick          15           MEX          1 --> 4
2  NaN --> juli  NaN --> 14   NaN --> CAN        NaN --> 0
chris
  • 1,267
  • 7
  • 20