How to handle 0 value in categoric variable column?

Question

I have a column 'Gender' inside a synthetic dataframe with value_counts that look like this:

df['Gender'].value_counts()

    male       42758
    female     27170
    other      27060
    unknown     6849
    0            724
    Name: Gender, dtype: int64

I am preprocessing this dataset for linear regression. Does it make sense to club '0' and 'unknown' together and replace their occurrences with 'male', since 'male' is the most frequently occurring value?

If you want to be extra cautious, do you know when observations are unknown? Is it random or is it caused by something? is it more likely that females end as unknowns? Sometimes it might be better to leave them as a separate category, but *most* times replacing with the mode might be enough. — Juan C, Dec 12 '19 at 14:40
It really depends on your population. If you replace them with `male`, then you have roughly three quarters of males. Do you expect that in a general population? — Quang Hoang, Dec 12 '19 at 14:43
https://datascience.stackexchange.com/questions/39058/dealing-with-nan-missing-values-for-logistic-regression-best-practices — PV8, Dec 12 '19 at 15:15
https://stackoverflow.com/questions/33113947/using-scikit-learn-sklearn-how-to-handle-missing-data-for-linear-regression — PV8, Dec 12 '19 at 15:15

score 0 · Answer 1 · answered Dec 12 '19 at 16:26

0

you can drop those rows as them count is very low compared to other levels of this column.
another solution is to remove then values and fillna using the median, mode or nearest value from other rows.

answered Dec 12 '19 at 16:26

Sayed A. Omar

56
3

How to handle 0 value in categoric variable column?

1 Answers1