I'm making a mask for my df (imported CSV file) based on string comparisons, but it seems that .contains
works, but ==
doesn't.
This mask using .contains
:
mask = (y_train['SEPSISPATOS'].str.contains('Yes')) | (y_train['SEPSHOCKPATOS'].str.contains('Yes')) | (y_train['OTHSYSEP'].str.contains('Sepsis')) | (y_train['OTHSESHOCK'].str.contains('Septic Shock'))
returns this output (note last line):
SEPSISPATOS SEPSHOCKPATOS OTHSYSEP OTHSESHOCK SEPSISPATOS
b'No' b'No' b'No Complication' b'No Complication' 0
b'No' b'No' b'No Complication' b'No Complication' 0
b'No' b'No' b'No Complication' b'No Complication' 0
b'No' b'No' b'No Complication' b'Septic Shock' 1
while this other mask using direct comparison
mask = (y_train['SEPSISPATOS']=='Yes') | (y_train['SEPSHOCKPATOS']=='Yes') | (y_train['OTHSYSEP']=='Sepsis') | (y_train['OTHSESHOCK']=='Septic Shock')
returns:
SEPSISPATOS SEPSHOCKPATOS OTHSYSEP OTHSESHOCK SEPSISPATOS
b'No' b'No' b'No Complication' b'No Complication' 0
b'No' b'No' b'No Complication' b'No Complication' 0
b'No' b'No' b'No Complication' b'No Complication' 0
b'No' b'No' b'No Complication' b'Septic Shock' 0
Wondering if I have bytes of strings rather than Python 3 Unicode strings, I have tried decoding (below). I have also tried .str.strip()
. Neither of which worked. I need a fix that will let me use direct comparisons between strings for any columns containing text.
Edit re: utf-8 decoding
NSQIPdf_train = pd.read_csv("acs_nsqip_puf13_2.csv")
str_df=df.select_dtypes([np.object])
str_df=str_df.stack().str.decode('utf-8').unstack()
for col in str_df:
NSQIPdf_train[col] = str_df[col]
y_train = NSQIPdf_train.loc[:,('SEPSISPATOS','SEPSHOCKPATOS', 'OTHSYSEP', 'OTHSESHOCK')]
This further compounded my problem... as the output became:
SEPSISPATOS SEPSHOCKPATOS OTHSYSEP OTHSESHOCK SEPSISPATOS
NaN NaN NaN NaN 0
NaN NaN NaN NaN 0
NaN NaN NaN NaN 0
NaN NaN NaN NaN 0