2

I have a few categorical variables which I binary encoded.

The problem is there are a lot of Nan values, I know I can just do df.fillna(0) for replacing the nan values..but will that be meaningful for machine learning?

Some columns have data and some columns are filled with Nans, and this varies row by row.

How to make the data useful? What specific operation is required?

david nadal
  • 279
  • 4
  • 16

1 Answers1

4

Missing Values are most common, to fill some data in that position there are various methods. But before filling some data remember that missed data some what closed to real data. For example, In financial analysis when the customer transaction value is missing, then you should not put zero, for that you could fill it by mean or median based on the data distribution.

Filling missed data critically depends on the data and business logic.

you could fill value by one of following methods,

  1. filling with constant

df.fillna(0)

  1. filling with mean,median etc.,

df.fillna(df.mean())

  1. groupby filling with mean,median etc.,

df['a'].fillna(df.groupby('b')['a'].transform('mean'))

there are many methods and details please visit here

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Mohamed Thasin ah
  • 10,754
  • 11
  • 52
  • 111