Using bool index on `df.loc...str.extract()` returns unexpected result

Question

I have multiple reports I'm putting together. I used regex to extract substring from a unique id based on this nice answer here How to extract specific content in a pandas dataframe with a regex?. Here is my toy example:

df1 = pd.DataFrame([['O-11111', 'fruit'], 
                   ['O-11111', 'fruit'],
                   ['O-11111', 'fruit'], 
                   ['O-11111', 'fruit'],
                   ['O-99999', 'veggie']
                  ], columns=['id', 'cat'])
df2 = pd.DataFrame([['O-99999', 'veggie'], 
                   ['O-99999', 'veggie'],
                   ['O-99999', 'veggie'], 
                   ['O-99999', 'veggie'],
                   ['O-99999', 'veggie']
                  ], columns=['id', 'cat'])
df3 = pd.concat([df1,df2])
df3['id1'] = df3.id.str.extract(r'O-([0-9]{5})')

    id      cat     id1
0   O-11111 fruit   11111
1   O-11111 fruit   11111
2   O-11111 fruit   11111
3   O-11111 fruit   11111
4   O-99999 veggie  99999
0   O-99999 veggie  99999
1   O-99999 veggie  99999
2   O-99999 veggie  99999
3   O-99999 veggie  99999
4   O-99999 veggie  99999

Later I needed to add a second column based on a tag match. I use df.loc together with a nice boolean index:

df3.cat == 'fruit'
0     True
1     True
2     True
3     True
4    False
0    False
1    False
2    False
3    False
4    False
Name: cat, dtype: bool

But when I put the two pieces together, I get unexpected results: id_fruit has IDs where the tag is veggie (and they're the wrong ID).

df3['id_fruit'] = df3.loc[df3.cat == 'fruit', 'id'].str.extract(r'O-([0-9]{5})')

    id      cat     id1     id_fruit
0   O-11111 fruit   11111   11111
1   O-11111 fruit   11111   11111
2   O-11111 fruit   11111   11111
3   O-11111 fruit   11111   11111
4   O-99999 veggie  99999   NaN
0   O-99999 veggie  99999   11111
1   O-99999 veggie  99999   11111
2   O-99999 veggie  99999   11111
3   O-99999 veggie  99999   11111
4   O-99999 veggie  99999   NaN

Only after a bunch of debugging I notice the pattern of the error, and think to reset the index, and I get the expected result.

df3.reset_index(drop=True, inplace=True)
df3['id2_fruit'] = df3.loc[df3.cat == 'fruit', 'id'].str.extract(r'O-([0-9]{5})')

    id      cat     id1     id_fruit  id2_fruit
0   O-11111 fruit   11111   11111     11111
1   O-11111 fruit   11111   11111     11111
2   O-11111 fruit   11111   11111     11111
3   O-11111 fruit   11111   11111     11111
4   O-99999 veggie  99999   NaN       NaN
5   O-99999 veggie  99999   11111     NaN
6   O-99999 veggie  99999   11111     NaN
7   O-99999 veggie  99999   11111     NaN
8   O-99999 veggie  99999   11111     NaN
9   O-99999 veggie  99999   NaN       NaN

I'm thinking the boolean index should be the only element making the match. What don't I understand about using df.loc that it should behave like this?

Use `df3 = pd.concat([df1,df2]).reset_index(drop=True)` In the case of your dataframe and using `.loc`, there are multiple occurrences of the index having the same name. — Trenton McKinney, Aug 01 '20 at 22:56
Trenton is right, also what you are doing is a) creating a series with a filtered subset of df3 and exctract b) using that series create a column in df3, pandas will align indexes before assigning values (i.e. indexes in df3 that are not present in the new series will have NaN) ... so instead of forcing pandas to look out for these idxs with NaN, maybe force the assignment to only target rows with `df3[df3.cat == 'fruit', 'id2_fruit'] = df3.loc[df3.cat == 'fruit', 'id'].str.extract(r'O-([0-9]{5})')` — RichieV, Aug 02 '20 at 02:21
"there are multiple occurrences of the index having the same name". Yes. Why does it matter if you're using a boolean index as the filter? Isn't this counter intuitive to how boolean indexes make selections from DF. Which leads me to think this behavior is caused by `df.loc`. — xtian, Aug 02 '20 at 16:58
_"instead of forcing pandas to look out for these idxs with NaN"_ I'm not clear on this comment. NaN in `id2_fruit` is the correct response. It's a new column, and the result of not matching the `cat == 'fruit'` constraint. That said, I'm intrigued by your method to duplicate the matching constraint on both sides of the assignment operator. — xtian, Aug 02 '20 at 17:04

Using bool index on `df.loc...str.extract()` returns unexpected result

0 Answers0

Linked