Ludwig preprocessing

Question

I'm running a model with Ludwig.

Features

workclass has almost 70% instances of Private, the Unknown (?) can be imputed with this value.

native_country, 90% of the instances are United States which can be used to impute for the Unknown (?) values. Same cannot be said about occupation column as the values are more distributed.

capital_gain has 72% instances with zero values for less than 50K and 19% instances with zero values for >50K.

capital_loss has 73% instances with zero values for less than 50K and 21% instances with zero values for >50K.

When I define the model what is the best way to do it for the above cases?

{
  "name": "workclass",
  "type": "category"
  "preprocessing": {
    "missing_value_strategy": "fill_with_mean"
  }
},
{
  "name": "native_country",
  "type": "category"
  "preprocessing": {
    "missing_value_strategy": "fill_with_mean"
  }
},
{
  "name": "capital_gain",
  "type": "numerical"
  "preprocessing": {
    "missing_value_strategy": "fill_with_mean",       
  }
},
{
  "name": "capital_loss",
  "type": "numerical"
  "preprocessing": {
    "missing_value_strategy": "fill_with_mean"
  }
},

Questions:

1) For category features how to define: If you find ?, replace it with X.

2) For numerical features how to define: If you find 0, replace it with mean?

Looking at the Ludwig documentation, I'm not sure what the definition of "a missing value" is though it appears that it would be a CSV entry that contains nothing as in ",," and not a CSV entry that contains a question mark as in ",?," or a zero as in ",0,". If this definition of a missing value is true then it would appear that you will need to do some preprocessing of your dataset to replace ",?," with ",," and to replace ",0," with ",,". — Richard Chambers, Jul 02 '19 at 13:54

w4nderlust · Accepted Answer · 2019-07-03T01:05:14.750

Ludwig currently considers missing values in the CSV file, like with two consecutive commas for it's replacement strategies. In your case I would suggest to do some minimal preprocessing to your dataset by replacing the zeros and ? with missing values or depending on the type of feature. You can easily do it in pandas with something like: df[df.my_column == <value>].my_column = <new_value>. The alternative is to perform the replacement already in your code (for instance replacing 0s with averages) so that Ludwig doesn't have to do it and you have full control of the replacement strategy.

Ludwig preprocessing

1 Answers1