2

I am trying to iterate over three data frames to find the difference between them. I have a master data frame which contains everything and two other data frames which contains partial of master data frame. I am trying to write a python code to identify what is missing in the other two files. Master file looks like following:

ID  Name
1   Mike
2   Dani
3   Scott
4   Josh
5   Nate
6   Sandy

second data frame looks like following:

ID  Name
1   Mike
2   Dani
3   Scott
6   Sandy

Third data frame looks like following:

ID  Name
1   Mike
2   Dani
3   Scott
4   Josh
5   Nate

So there will be two output data frame. Desired output for looks like following for second data frame:

ID  Name
4   Josh
5   Nate

desired output for third data frame looks like following:

ID  Name
6   Sandy

I didn't find anything similar on Google. I tried this:

for i in second['ID'], third['ID']:
   if i not in master['ID']:
     print(i)

It returns all the data in master file.

Also if I try this code :

import pandas as pd

names = ["Mike", "Dani", "Scott", "Josh", "Nate", "Sandy"]
ids = [1, 2, 3, 4, 5, 6]
master = pd.DataFrame({"ID": ids, "Name": names})
# print(master)

names_second = ["Mike", "Dani", "Scott", "Sandy"]
ids_second = [1, 2, 3, 6]
second = pd.DataFrame({"ID": ids_second, "Name": names_second})
# print(second)

names_third = ["Mike", "Dani", "Scott", "Josh", "Nate"]
ids_third = [1, 2, 3, 4, 5]
third = pd.DataFrame({"ID": ids_third, "Name": names_third})
# print(third)
for i in master['ID']:
    if i not in second["ID"]:
        print("NOT IN SECOND", i)
    if i not in third["ID"]:
        print("NOT IN THIRD", i)

OUTPUT ::

NOT IN SECOND 4
NOT IN SECOND 5
NOT IN THIRD 5
NOT IN SECOND 6
NOT IN THIRD 6

Why it says NOT IN SECOND 6 and NOT IN THIRD 5?

Any suggestion? Thanks in advance.

Shashank Singh
  • 647
  • 1
  • 5
  • 22
Bonjuga Lewis
  • 103
  • 1
  • 6

2 Answers2

3

You can try using .isin with ~ to filter dataframes. To compare with second you can use master[~master.ID.isin(second.ID)] and similar for third:

cmp_master_second, cmp_master_third = master[~master.ID.isin(second.ID)],  master[~master.ID.isin(third.ID)]

print(cmp_master_second)
print('\n-------- Seperate dataframes -----------\n')
print(cmp_master_third)

Result:

    Name
ID      
4   Josh
5   Nate

-------- Seperate dataframes -----------

     Name
ID       
6   Sandy
niraj
  • 17,498
  • 4
  • 33
  • 48
  • When I tried your code I got `5 Nate` in the second dataframe. I have added the code in the main question about how I created my `master, second and third dataframes`. – Shashank Singh Jun 21 '18 at 03:39
  • @ShashankSingh Try now, I missed to add `ID` in third and instead added `index`. – niraj Jun 21 '18 at 03:43
0

You could do a set difference on the master and the other DataFrames

In [315]: set(d1[0]) - set(d2[0])
Out[315]: {'Josh', 'Nate'}

In [316]: set(d1[0]) - set(d3[0])
Out[316]: {'Sandy'}
aydow
  • 3,673
  • 2
  • 23
  • 40