Determining values for each row of a DataFrame

Question

here is my DataFrame

Tipo    Número  renal   dialisis
CC  260037  NULL    NULL
CC  260037  NULL    AAB
CC  165182  NULL    NULL
CC  165182  NULL    CCDE
CC  260039  NULL    NULL
CC  49740   XYZ NULL
CC  260041  NULL    NULL
CC  259653  NULL    NULL

I want to determine if values in renal and dialisis are NULL ore not, for each row in the DataFrame. Those rows which are not NULL will be 1 in survived list; and if they are both NULL are going to be 0. My code is:

survival = pd.read_table('Sophia_Personalizado bien.txt',encoding='utf-16')
survived = []
numero_paciente = []
lista_pacienytes= survival['Número'].values.tolist()
lista_pacienytes= sorted(set(lista_pacienytes))


for e in lista_pacienytes:
    survival_i = survival.loc[survival['Número']==e]
    renal = set(survival_i['renal'].values.tolist())
    dialisis = set(survival_i["dialisis"].values.tolist())

    print('dialisis',dialisis)
    print('renal',renal)

    if renal == 'nan' or dialisis == 'nan':
        survived.append(0)
        numero_paciente.append(e)
    else:
        survived.append(1)
        numero_paciente.append(e)

e = pd.DataFrame({'numero': numero_paciente,
                  'survival': survived})

Surprisingly, all rows equal to 1, but as we can see in the DataFrame it is not true. Also, the result of

print('dialisis',dialisis)
print('renal',renal)

is:

dialisis {nan, nan}
renal {nan}

which should be NAN as I use set(). What am I missing? Thanks

fuglede · Answer 1 · 2018-10-08T21:04:35.513

For the double NaNs, see this question; essentially it can happen because np.nan != np.nan, but it is not consistent:

In [75]: set(np.array([np.nan, np.nan]))
Out[75]: {nan, nan}

In [76]: set([np.nan, np.nan])
Out[76]: {nan}

Regarding the issue of having too many surviving rows, this boils down to the fact that you compare renal and dialisis to the string 'nan' rather than the float np.nan. You can either compare with equality with np.nan directly or use np.isnan to do so.

Note, however, that idiomatic pandas (and NumPy for that matter) typically has you perform the operations one column at a time when possible, rather than picking out the values and iterating over those, so in your case, what you are looking for can also be obtained through the following:

In [66]: df['survived'] = ~(df.renal.isnull() & df.dialisis.isnull())

In [67]: df
Out[67]:
  Tipo  Número renal dialisis  survived
0   CC  260037   NaN      NaN     False
1   CC  260037   NaN      AAB      True
2   CC  165182   NaN      NaN     False
3   CC  165182   NaN     CCDE      True
4   CC  260039   NaN      NaN     False
5   CC   49740   XYZ      NaN      True
6   CC  260041   NaN      NaN     False
7   CC  259653   NaN      NaN     False

Here, an alternative way of getting the same would be to apply isnull to both columns at once, through ~df[['renal', 'dialisis']].isnull().all(axis=1).

If you really prefer having 0s and 1s instead:

In [71]: df['survived'] = df['survived'].astype(int)

In [72]: df
Out[72]:
  Tipo  Número renal dialisis  survived
0   CC  260037   NaN      NaN         0
1   CC  260037   NaN      AAB         1
2   CC  165182   NaN      NaN         0
3   CC  165182   NaN     CCDE         1
4   CC  260039   NaN      NaN         0
5   CC   49740   XYZ      NaN         1
6   CC  260041   NaN      NaN         0
7   CC  259653   NaN      NaN         0

Excellent explanation. I wish more Pandas answers are like this! — jpp, Oct 07 '18 at 09:50
In addition to the above, I get the sense that the repeated calls to `set` mean duplicates should be removed. Using this answer together with a simple `df.groupby("Numero").survived.max()` would achieve this. — coffeinjunky, Oct 07 '18 at 09:57
@coffeinjunky: Not sure if this is what you mean, but `set(set(np.array([np.nan, np.nan])))` has two elements. — fuglede, Oct 07 '18 at 10:00
This was just a remark for the op, not so much a comment on your solution. Looking at the original code, it looks like he/she wants one row per patient ID (numero) as the outcome, and not (as written) one row per original row. I may be wrong. Just wanted to point out how he/she could achieve this. — coffeinjunky, Oct 07 '18 at 10:07

Determining values for each row of a DataFrame

1 Answers1