Pandas boolean indexing vs comparision of two lists: Same (?) methods, different results

Question

I have a table called bom2 (about 74k rows + other columns, but we are interested only in Assembly and Finish product. To clarify things a little the same Assembly and Finish product ID can be present in both columns.

  bom2=pd.DataFrame({'Assembly': {0: '0138104116',
  1: '0309005994',
  2: '0309005996',
  3: '0309005998',
  4: '0309005998',
  5: '0309005998',
  6: '0309005998',
  7: '0312005996',
  8: '0312005997',
  9: '0312005998'},
 'Finish product': {0: '0138104116',
  1: '0309005994',
  2: '0309005996',
  3: '0309005998',
  4: '0309005998',
  5: '0309005998',
  6: '0309005998',
  7: '0312005996',
  8: '0312005997',
  9: '0312005998'}})

I need to:

take all Assembly IDs

compare with Finish product IDs

take only rows where Finish product doesnt have its ID anywhere in the Assembly column.

Method 1:

nejsou = bom2[-bom2['Finish product'].isin(bom2.Assembly.tolist())]

Method 2:

ass = bom2.Assembly.tolist()
fin = bom2['Finish product'].tolist()
nee=list()  #will contain Assembly IDs which do not occur in 'Finish product'
for a in ass:
    if a not in fin: nee.append(a)

In next step use IDs in 'nee' to search corresponding rows by

bom2[bom2.Assembly.isin(nee)]

I was expecting these two methods to be equivalent, but when I check the number of Assembly IDs resulting from each method I get different numbers.

print "method(1 + 2):", len(set(nejsou.Assembly.tolist()+nee)), "Unique Assembly IDs"
print "method(1):",len(set(nejsou.Assembly.tolist())), "Unique Assembly IDs"
print "method(2):",len(set(nee)), "Unique Assembly IDs"

method(1 + 2): 4021 Unique Assembly IDs
method(1): 4015 Unique Assembly IDs
method(2): 1986 Unique Assembly IDs

It seems that Method 1 (boolean filtering) captures all cases captured by method 2 + about 2k others and I cant figure out what are the cases.

How are the two methods different from each other? Is there any other way I could achieve desired results?

In `bom2[-...]`, is the minus sign intentional? If so, what did you mean by it? — Ami Tavory, Jun 28 '15 at 13:10
Yea, minus sign or alternatively ~ acts as logical not (Negation) so -True == False and -False == True as in http://stackoverflow.com/questions/14057007/remove-rows-not-isinx — redacted, Jun 28 '15 at 14:02
Interesting. I always used `~`, and didn't know `-` does the same thing. Thanks. — Ami Tavory, Jun 28 '15 at 15:08

unutbu · Accepted Answer · 2015-06-28T13:50:58.387

Consider the following toy example:

import numpy as np
import pandas as pd

bom2 = pd.DataFrame(
    [
        ['A', 'F'],
        ['AF', 'F'],

        ['AF2', 'F'],   
        ['A', 'AF2'], 

        ['A', 'AF'],
        ['A2', 'AF'],

    ], columns=['ASS', 'FIN'])

print(bom2)
#    ASS  FIN
# 0    A    F
# 1   AF    F
# 2  AF2    F
# 3    A  AF2
# 4    A   AF
# 5   A2   AF

The "pattern" is:

AF and AF2 refer to IDS which are in both ASS and FIN.
A and A2 are only in ASS.
F is only in FIN.

Then:

ass = bom2['ASS'].tolist()
fin = bom2['FIN'].tolist()

# filter out rows of bom2 where FIN equals AF
nejsou = bom2[-bom2['FIN'].isin(ass)] 

nee = list()  #will contain ASS not in FIN
for a in ass:
    if a not in fin: nee.append(a)

print(nejsou)
#    ASS FIN
# 0    A   F
# 1   AF   F
# 2  AF2   F

yields

print "method(1 + 2):", len(set(nejsou.ASS.tolist()+nee)), "Unique ASS IDs"
# method(1 + 2): 4 Unique ASS IDs

print "method(1):",len(set(nejsou.ASS.tolist())), "Unique ASS IDs"
# method(1): 3 Unique ASS IDs

print "method(2):",len(set(nee)), "Unique ASS IDs"
# method(2): 2 Unique ASS IDs

Notice that

print(set(nee))
# set(['A', 'A2'])

print(set(nejsou.ASS.tolist()))
# set(['A', 'AF2', 'AF'])

So method(1) and method(2) produce different results. If you follow the pattern, you can add as many items as you wish to either set.

To increase method(1) without increasing method(2), add more rows like AF
```
['AFn', 'F'],
['A', 'AFn'],
```
To increase method(2) without increasing method(1), add more rows like A2
```
['An', 'AF'], 
```

To take only rows where Finish product doesn't have its ID anywhere in the Assembly column use method 1:

In [107]: bom2.loc[~bom2['FIN'].isin(bom2['ASS'])]
Out[107]: 
   ASS FIN
0    A   F
1   AF   F
2  AF2   F

not method 2:

In [109]: bom2[bom2['ASS'].isin(nee)]
Out[109]: 
  ASS FIN
0   A   F
4   A  AF  <-- WRONG; AF is in bom2['ASS']
5  A2  AF  <-- WRONG

Geeocode · Answer 2 · 2015-06-28T16:30:31.850

This is not about the method1 and method2 and thus not about the programming technics rather about logic. What is more I have an idea that you will get the same results if you use boolean indexing/filtering at the second method and vica versa. It is because, if we merge too list with some unique values in each, there will be identical values in each list which will turn into one value instead of two after the set() function.

Consider the following example:

a = [1,2,3]
b = [3,4,5]

print set(a)
print set(b)
print
print len(set(a))
print len(set(b))
print
print set(a+b)
print
print len(set(a+b))

Output:

set([1, 2, 3])
set([3, 4, 5])

3
3

set([1, 2, 3, 4, 5])

5

Glad to have been of help! Feel free to accept my answer if you feel it was useful to you. :-)

Pandas boolean indexing vs comparision of two lists: Same (?) methods, different results

2 Answers2