1

I'm making a mask for my df (imported CSV file) based on string comparisons, but it seems that .contains works, but == doesn't.

This mask using .contains:

mask = (y_train['SEPSISPATOS'].str.contains('Yes')) | (y_train['SEPSHOCKPATOS'].str.contains('Yes')) | (y_train['OTHSYSEP'].str.contains('Sepsis')) | (y_train['OTHSESHOCK'].str.contains('Septic Shock'))

returns this output (note last line):

SEPSISPATOS   SEPSHOCKPATOS   OTHSYSEP   OTHSESHOCK           SEPSISPATOS
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'Septic Shock'      1            

while this other mask using direct comparison

mask = (y_train['SEPSISPATOS']=='Yes') | (y_train['SEPSHOCKPATOS']=='Yes') | (y_train['OTHSYSEP']=='Sepsis') | (y_train['OTHSESHOCK']=='Septic Shock')

returns:

SEPSISPATOS   SEPSHOCKPATOS   OTHSYSEP   OTHSESHOCK           SEPSISPATOS
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'No Complication'   0
b'No'         b'No'  b'No Complication'  b'Septic Shock'      0            

Wondering if I have bytes of strings rather than Python 3 Unicode strings, I have tried decoding (below). I have also tried .str.strip(). Neither of which worked. I need a fix that will let me use direct comparisons between strings for any columns containing text.

Edit re: utf-8 decoding

NSQIPdf_train = pd.read_csv("acs_nsqip_puf13_2.csv")
str_df=df.select_dtypes([np.object])
str_df=str_df.stack().str.decode('utf-8').unstack()
for col in str_df:
    NSQIPdf_train[col] = str_df[col]
y_train = NSQIPdf_train.loc[:,('SEPSISPATOS','SEPSHOCKPATOS', 'OTHSYSEP', 'OTHSESHOCK')]

This further compounded my problem... as the output became:

SEPSISPATOS   SEPSHOCKPATOS   OTHSYSEP   OTHSESHOCK        SEPSISPATOS
NaN            NaN            NaN        NaN               0
NaN            NaN            NaN        NaN               0
NaN            NaN            NaN        NaN               0
NaN            NaN            NaN        NaN               0          
  • `y_train['SEPSHOCKPATOS'].str.==('Yes')` is not valid Python syntax – IanS Jul 16 '19 at 15:32
  • besides, the parenthesis around `'Yes'` are a distraction – IanS Jul 16 '19 at 15:35
  • finally, `.str` is an accessor, so `y_train['SEPSHOCKPATOS'].str == 'Yes'` is not doing what you think it does (try printing `y_train['SEPSHOCKPATOS'].str`) – IanS Jul 16 '19 at 15:36
  • thank you, i've made those changes and updated the code (third block) in the original post to reflect it but y_train['SEPSISPATOS']=='Yes' doesn't seem to work either. same output as before. – michellemabelle Jul 16 '19 at 15:45
  • `.str.decode('utf-9')` should be utf-8 – IanS Jul 17 '19 at 05:48
  • Was this problem ever solved? I'm having a similar issue where I'm comparing two columns of two different frames, same index values (ie second frame made from the first), same dtype, and comparisons should return true but are returning false with an == operator. – Dan Jan 17 '23 at 21:55

2 Answers2

0

Use .str.decode('utf-8') to convert your byte values to strings before doing the comparison (see this question):

y_train['SEPSHOCKPATOS'].str.decode('utf-8') == 'Yes'

Note: I guess that .str.contains does a conversion under the hood.

IanS
  • 15,771
  • 9
  • 60
  • 84
  • I tried both y_train['SEPSHOCKPATOS'].str.decode('utf-8') == 'Yes', which resulted in the same output as before. Based on the link in your suggestion, I also tried decoding the entire df beforehand, which actually resulted in more problems. Printing out y_train['SEPSHOCKPATOS'].str.decode('utf-8') gives me NaNs. – michellemabelle Jul 16 '19 at 16:08
0

I'm a newbie to Pandas, but maybe str.fullmatch helps - a stricter version of str.contains that matches the whole string, thus,

y_train['SEPSHOCKPATOS'].str.fullmatch('Yes')

although note that this is actually checking against a regular expression, so take care if the string you're using contains any special characters.

Andrew Richards
  • 1,392
  • 11
  • 18