0

I have a column 'Gender' inside a synthetic dataframe with value_counts that look like this:

df['Gender'].value_counts()

    male       42758
    female     27170
    other      27060
    unknown     6849
    0            724
    Name: Gender, dtype: int64

I am preprocessing this dataset for linear regression. Does it make sense to club '0' and 'unknown' together and replace their occurrences with 'male', since 'male' is the most frequently occurring value?

  • 1
    Essentially replacing with the mode. Yes, it makes sense – yatu Dec 12 '19 at 14:38
  • If you want to be extra cautious, do you know when observations are unknown? Is it random or is it caused by something? is it more likely that females end as unknowns? Sometimes it might be better to leave them as a separate category, but *most* times replacing with the mode might be enough. – Juan C Dec 12 '19 at 14:40
  • It really depends on your population. If you replace them with `male`, then you have roughly three quarters of males. Do you expect that in a general population? – Quang Hoang Dec 12 '19 at 14:43
  • https://datascience.stackexchange.com/questions/39058/dealing-with-nan-missing-values-for-logistic-regression-best-practices – PV8 Dec 12 '19 at 15:15
  • https://stackoverflow.com/questions/33113947/using-scikit-learn-sklearn-how-to-handle-missing-data-for-linear-regression – PV8 Dec 12 '19 at 15:15
  • The model will underfit if you club to the male category – Saniya Parveez May 27 '20 at 08:58

1 Answers1

0
  • you can drop those rows as them count is very low compared to other levels of this column.
  • another solution is to remove then values and fillna using the median, mode or nearest value from other rows.