How to replace NAN values based on the values in another column in pandas

Question

I am using the breast-cancer-wisconsin dataset that looks as follows:

The Bare Nuclei column has 16 missing entries denoted by "?" which I replace with NAN as follows:

df.replace('?', np.NAN, regex=False, inplace = True)

resulting in this (a few of the 16 missing entries):

I want to replace the NANs with the most frequently occurring value with respect to each class. To elaborate, the most frequently occurring value in column 'Bare Nuclei' which has class=2 (benign cancer) should be used to replace all the rows that have 'Bare Nuclei' == NAN and Class == 2. Similarly for class = 4 (malignant).

I tried the following:

df[df['Class']== 2]['Bare Nuclei'].fillna(df_vals[df_vals['Class']==2]['Bare Nuclei'].mode(), inplace=True)

df[df['Class']== 4]['Bare Nuclei'].fillna(df_vals[df_vals['Class']==4]['Bare Nuclei'].mode(), inplace=True)

It did not result in any error but when I tried this:

df.isnull().any()

Bare Nuclei shows True which means the NAN values are still there.

(column "Bare Nuclei" is of type object)

I don't understand what I am doing wrong. Please help! Thank you.

Anurag Dabas · Accepted Answer · 2021-08-17T05:13:30.243

2

You can try via groupby()+agg()+fillna():

s=df_vals.groupby('class')['Bare Nuclei'].agg(lambda x:x.mode(dropna=False).iat[0])
df['Bare Nuclei']=df['Bare Nuclei'].fillna(df['class'].map(s))

OR

by your approach use loc:

df.loc[df['Class']== 2,'Bare Nuclei'].fillna(df_vals.loc[df_vals['Class']==2,'Bare Nuclei'].mode(), inplace=True)

edited Aug 17 '21 at 05:13

answered Aug 17 '21 at 05:01

Anurag Dabas

23,866
9
21
41

1

@Kiera.K also pls see [understanding-inplace-true](https://stackoverflow.com/questions/43893457/understanding-inplace-true) – Anurag Dabas Aug 17 '21 at 05:09

user2583808 · Answer 2 · 2021-12-10T17:04:17.523

As a late answer, if you want to replace every NaN you have in the "Bare Nuclei" column by the values in the column "Class":

selection_condition = pd.isna(df["Bare Nuclei"])
df["Bare Nuclei"].iloc[selection_condition] = df[selection_condition]["Class"]

If you you want to be class specific regarding your replacement:

selection_condition = pd.isna(df["Bare Nuclei"]) & (df["Class"] == 2)
df["Bare Nuclei"].iloc[selection_condition] = df[selection_condition]["Class"]

William Giddens · Answer 3 · 2022-12-03T19:19:36.807

file.info()
file['Bare Nuclei'].loc[file['Bare Nuclei'] == '?'] = panda.nan

file.dropna(inplace = True)
file.drop(['Sample code number'],axis = 1,inplace = True)
file['Bare Nuclei'] = file.astype({"Bare Nuclei": int})

from sklearn.metrics import accuracy_score
for i in range(num_split):
    first = filename.drop(['Class','Bare Nuclei'],axis=1)
    second = filename['Class'].values
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.8, random_state = 0)
    classifier = LogisticRegression(max_iter = 200, solver = 'newton-cg')
    classifier.fit(x_train, y_train)
    Sk_overall = Sk_overall + classifier.score(x_test,y_test)
    Sk_Accuracy = Sk_overall/i

How to replace NAN values based on the values in another column in pandas

3 Answers3