I have multiple reports I'm putting together. I used regex to extract substring from a unique id based on this nice answer here How to extract specific content in a pandas dataframe with a regex?. Here is my toy example:
df1 = pd.DataFrame([['O-11111', 'fruit'],
['O-11111', 'fruit'],
['O-11111', 'fruit'],
['O-11111', 'fruit'],
['O-99999', 'veggie']
], columns=['id', 'cat'])
df2 = pd.DataFrame([['O-99999', 'veggie'],
['O-99999', 'veggie'],
['O-99999', 'veggie'],
['O-99999', 'veggie'],
['O-99999', 'veggie']
], columns=['id', 'cat'])
df3 = pd.concat([df1,df2])
df3['id1'] = df3.id.str.extract(r'O-([0-9]{5})')
id cat id1
0 O-11111 fruit 11111
1 O-11111 fruit 11111
2 O-11111 fruit 11111
3 O-11111 fruit 11111
4 O-99999 veggie 99999
0 O-99999 veggie 99999
1 O-99999 veggie 99999
2 O-99999 veggie 99999
3 O-99999 veggie 99999
4 O-99999 veggie 99999
Later I needed to add a second column based on a tag match. I use df.loc
together with a nice boolean index:
df3.cat == 'fruit'
0 True
1 True
2 True
3 True
4 False
0 False
1 False
2 False
3 False
4 False
Name: cat, dtype: bool
But when I put the two pieces together, I get unexpected results: id_fruit
has IDs where the tag is veggie
(and they're the wrong ID).
df3['id_fruit'] = df3.loc[df3.cat == 'fruit', 'id'].str.extract(r'O-([0-9]{5})')
id cat id1 id_fruit
0 O-11111 fruit 11111 11111
1 O-11111 fruit 11111 11111
2 O-11111 fruit 11111 11111
3 O-11111 fruit 11111 11111
4 O-99999 veggie 99999 NaN
0 O-99999 veggie 99999 11111
1 O-99999 veggie 99999 11111
2 O-99999 veggie 99999 11111
3 O-99999 veggie 99999 11111
4 O-99999 veggie 99999 NaN
Only after a bunch of debugging I notice the pattern of the error, and think to reset the index, and I get the expected result.
df3.reset_index(drop=True, inplace=True)
df3['id2_fruit'] = df3.loc[df3.cat == 'fruit', 'id'].str.extract(r'O-([0-9]{5})')
id cat id1 id_fruit id2_fruit
0 O-11111 fruit 11111 11111 11111
1 O-11111 fruit 11111 11111 11111
2 O-11111 fruit 11111 11111 11111
3 O-11111 fruit 11111 11111 11111
4 O-99999 veggie 99999 NaN NaN
5 O-99999 veggie 99999 11111 NaN
6 O-99999 veggie 99999 11111 NaN
7 O-99999 veggie 99999 11111 NaN
8 O-99999 veggie 99999 11111 NaN
9 O-99999 veggie 99999 NaN NaN
I'm thinking the boolean index should be the only element making the match.
What don't I understand about using df.loc
that it should behave like this?