I am using the breast-cancer-wisconsin dataset that looks as follows:
The Bare Nuclei column has 16 missing entries denoted by "?" which I replace with NAN as follows:
df.replace('?', np.NAN, regex=False, inplace = True)
resulting in this (a few of the 16 missing entries):
I want to replace the NANs with the most frequently occurring value with respect to each class. To elaborate, the most frequently occurring value in column 'Bare Nuclei' which has class=2 (benign cancer) should be used to replace all the rows that have 'Bare Nuclei' == NAN and Class == 2. Similarly for class = 4 (malignant).
I tried the following:
df[df['Class']== 2]['Bare Nuclei'].fillna(df_vals[df_vals['Class']==2]['Bare Nuclei'].mode(), inplace=True)
df[df['Class']== 4]['Bare Nuclei'].fillna(df_vals[df_vals['Class']==4]['Bare Nuclei'].mode(), inplace=True)
It did not result in any error but when I tried this:
df.isnull().any()
Bare Nuclei shows True which means the NAN values are still there.
(column "Bare Nuclei" is of type object)
I don't understand what I am doing wrong. Please help! Thank you.