Comparison of strings not working in pandas dataframe?

Question

I'm making a mask for my df (imported CSV file) based on string comparisons, but it seems that .contains works, but == doesn't.

This mask using .contains:

mask = (y_train['SEPSISPATOS'].str.contains('Yes')) | (y_train['SEPSHOCKPATOS'].str.contains('Yes')) | (y_train['OTHSYSEP'].str.contains('Sepsis')) | (y_train['OTHSESHOCK'].str.contains('Septic Shock'))

returns this output (note last line):

SEPSISPATOS   SEPSHOCKPATOS   OTHSYSEP   OTHSESHOCK           SEPSISPATOS
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'Septic Shock'      1

while this other mask using direct comparison

mask = (y_train['SEPSISPATOS']=='Yes') | (y_train['SEPSHOCKPATOS']=='Yes') | (y_train['OTHSYSEP']=='Sepsis') | (y_train['OTHSESHOCK']=='Septic Shock')

returns:

SEPSISPATOS   SEPSHOCKPATOS   OTHSYSEP   OTHSESHOCK           SEPSISPATOS
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'Septic Shock'      0

Wondering if I have bytes of strings rather than Python 3 Unicode strings, I have tried decoding (below). I have also tried .str.strip(). Neither of which worked. I need a fix that will let me use direct comparisons between strings for any columns containing text.

Edit re: utf-8 decoding

NSQIPdf_train = pd.read_csv("acs_nsqip_puf13_2.csv")
str_df=df.select_dtypes([np.object])
str_df=str_df.stack().str.decode('utf-8').unstack()
for col in str_df:
    NSQIPdf_train[col] = str_df[col]
y_train = NSQIPdf_train.loc[:,('SEPSISPATOS','SEPSHOCKPATOS', 'OTHSYSEP', 'OTHSESHOCK')]

This further compounded my problem... as the output became:

SEPSISPATOS   SEPSHOCKPATOS   OTHSYSEP   OTHSESHOCK        SEPSISPATOS
NaN            NaN            NaN        NaN               0
NaN            NaN            NaN        NaN               0
NaN            NaN            NaN        NaN               0
NaN            NaN            NaN        NaN               0

`y_train['SEPSHOCKPATOS'].str.==('Yes')` is not valid Python syntax — IanS, Jul 16 '19 at 15:32
finally, `.str` is an accessor, so `y_train['SEPSHOCKPATOS'].str == 'Yes'` is not doing what you think it does (try printing `y_train['SEPSHOCKPATOS'].str`) — IanS, Jul 16 '19 at 15:36
thank you, i've made those changes and updated the code (third block) in the original post to reflect it but y_train['SEPSISPATOS']=='Yes' doesn't seem to work either. same output as before. — michellemabelle, Jul 16 '19 at 15:45
Was this problem ever solved? I'm having a similar issue where I'm comparing two columns of two different frames, same index values (ie second frame made from the first), same dtype, and comparisons should return true but are returning false with an == operator. — Dan, Jan 17 '23 at 21:55

score 0 · Answer 1 · answered Jul 16 '19 at 16:02

0

Use .str.decode('utf-8') to convert your byte values to strings before doing the comparison (see this question):

y_train['SEPSHOCKPATOS'].str.decode('utf-8') == 'Yes'

Note: I guess that .str.contains does a conversion under the hood.

answered Jul 16 '19 at 16:02

IanS

15,771
9
60
84

I tried both y_train['SEPSHOCKPATOS'].str.decode('utf-8') == 'Yes', which resulted in the same output as before. Based on the link in your suggestion, I also tried decoding the entire df beforehand, which actually resulted in more problems. Printing out y_train['SEPSHOCKPATOS'].str.decode('utf-8') gives me NaNs. – michellemabelle Jul 16 '19 at 16:08

score 0 · Answer 2 · answered Nov 10 '20 at 17:17

I'm a newbie to Pandas, but maybe str.fullmatch helps - a stricter version of str.contains that matches the whole string, thus,

y_train['SEPSHOCKPATOS'].str.fullmatch('Yes')

although note that this is actually checking against a regular expression, so take care if the string you're using contains any special characters.

Comparison of strings not working in pandas dataframe?

2 Answers2