I have an Age category
column in my pandas dataframe, df
. In the Age category
column, there are 32% missing values which I need to do some imputation. I'm thinking to use the distribution of the available data, which is 68% to impute the missing values.
The screenshot below is the distribution of the available data (the 68%) for the age category:
As you can see from the table,
36 - 45
, having 29.5%46 - 55
, having24.9%- etc..
Hence, I will expect that when I'm doing the imputation for the 32% missing values, age 36 - 45
will have approximately 29.5% as well, age 46 - 55
will have approximately 24.9% and etc...
Once I impute all the NaN
in the Age category
column, the overall distribution should not vary a lot compare to the one in the screenshot. May I know how should I achieve that?
Any help or advice will be greatly appreciated!