I have a table called bom2 (about 74k rows + other columns, but we are interested only in Assembly and Finish product. To clarify things a little the same Assembly and Finish product ID can be present in both columns.
bom2=pd.DataFrame({'Assembly': {0: '0138104116',
1: '0309005994',
2: '0309005996',
3: '0309005998',
4: '0309005998',
5: '0309005998',
6: '0309005998',
7: '0312005996',
8: '0312005997',
9: '0312005998'},
'Finish product': {0: '0138104116',
1: '0309005994',
2: '0309005996',
3: '0309005998',
4: '0309005998',
5: '0309005998',
6: '0309005998',
7: '0312005996',
8: '0312005997',
9: '0312005998'}})
I need to:
take all Assembly IDs
compare with Finish product IDs
take only rows where Finish product doesnt have its ID anywhere in the Assembly column.
Method 1:
nejsou = bom2[-bom2['Finish product'].isin(bom2.Assembly.tolist())]
Method 2:
ass = bom2.Assembly.tolist()
fin = bom2['Finish product'].tolist()
nee=list() #will contain Assembly IDs which do not occur in 'Finish product'
for a in ass:
if a not in fin: nee.append(a)
In next step use IDs in 'nee' to search corresponding rows by
bom2[bom2.Assembly.isin(nee)]
I was expecting these two methods to be equivalent, but when I check the number of Assembly IDs resulting from each method I get different numbers.
print "method(1 + 2):", len(set(nejsou.Assembly.tolist()+nee)), "Unique Assembly IDs"
print "method(1):",len(set(nejsou.Assembly.tolist())), "Unique Assembly IDs"
print "method(2):",len(set(nee)), "Unique Assembly IDs"
method(1 + 2): 4021 Unique Assembly IDs
method(1): 4015 Unique Assembly IDs
method(2): 1986 Unique Assembly IDs
It seems that Method 1 (boolean filtering) captures all cases captured by method 2 + about 2k others and I cant figure out what are the cases.
How are the two methods different from each other? Is there any other way I could achieve desired results?