Wondering if anyone can help me with this problem. I am working on a machine learning problem, I have classified the df1[Age]
column into df1[Age_group]
. Unfortunately there are missing data, so any df[Age]
which is NaN
is classified as 3
.
Currently the classification of 3
only means "missing data", and I want to update this to something useful. I have used scikit-learn logistic regression to guess the missing age groups and they are now stored in a Numpy array which I have called missing_age_grps
.
Obviously the data set I am working with is much bigger, but below should be enough data to illustrate the problem.
In the example below missing_age_grps
is an array of only 2, due to only 2 instances of df1[Age_group] == 3
import pandas as pd
import numpy as np
d = {'ID': [0, 1, 2, 3, 4], 'Sex': ["Male","Female","Male","Male", "Female"], 'Age':[np.nan, 23, np.nan, 6, 15] , 'Age_group':[3,2,3,0,1]}
df1 = pd.Dataframe(d)
print(df1)
ID Sex Age Age_group
0 Male NaN 3
1 Female 23 2
2 Male NaN 3
3 Male 6 0
4 Female 15 1 /....
print(missing_age_grps)
[0, 1]
I am having trouble re-writing only the values in the df1['Age_group']
which are represented by 3
.
The ideal solution will update only the 3's with the values from the numpy array. This is the expected output:
print(df1)
ID Sex Age Age_group
0 Male NaN 0
1 Female 23 2
2 Male NaN 1
3 Male 6 0
4 Female 15 1 /....