0

I would like to fill the missing values (18543) present in the target column/Dependent variable, Complaint-Status, in my data having class imbalance. There are five classes in the target column (multi-class classification problem).

What is the best way to fill these values without increasing the class imbalance?

Dataset

enter image description here

Replacing these missing values with mode of column i.e. 'Closed with explanation', will add to class imbalance only.

uniq, kounts = np.unique(df_ohe['Complaint-Status'], return_counts=True) 
print(np.asarray((uniq, kounts)).T)

[['' 18543]
 ['Closed' 809]
 ['Closed with explanation' 34300]
 ['Closed with monetary relief' 2818]
 ['Closed with non-monetary relief' 5018]
 ['Untimely response' 321]]

The target class percentage

100*c_count.values/c_count.values.sum()
# array([55.49353654, 30.00048537,  8.11855879,  4.55920659,  1.30887088,
        0.51934184])

Expected Output :

[['class_label', 18543]
 ['Closed' 809]
 ['Closed with explanation' 34300]
 ['Closed with monetary relief' 2818]
 ['Closed with non-monetary relief' 5018]
 ['Untimely response' 321]]
joel
  • 1,156
  • 3
  • 15
  • 42
  • It's unclear what you're asking exactly. What are your inputs, and what do you expect your code to do? – Jordan Singer Jan 25 '19 at 21:15
  • @Jordan I have updated the question – joel Jan 25 '19 at 21:26
  • I'm not sure what `Need_to_impute_values_here` means. What needs to go here? What does it mean to "impute" in this context? – Jordan Singer Jan 25 '19 at 21:27
  • Sorry. It would be the target class label like 'Closed'. Imputation == fill missing target label here – joel Jan 25 '19 at 21:32
  • 1
    The "Best" way is going to be entirely dependent on your overall data. You can use the mean, median, mode, or more advanced methods like those available from [fancyimpute](https://stackoverflow.com/questions/45239256/data-imputation-with-fancyimpute-and-pandas) – G. Anderson Jan 25 '19 at 21:34
  • I'm a little confused. Is this train/test data with the Dependent variable missing and you're trying to fill that in "randomly"? If you're trying to build a model on incomplete data, you will get a bad model (or worse a model that *looks* good and then goes rouge on validation data - there goes a few hours). Please clarify what you are trying to accomplish – MattR Jan 25 '19 at 21:45
  • @MattR Yes it is the dependent variable that is missing. Earlier I was thinking to fill with mode. But I believe it is wrong. – joel Jan 26 '19 at 05:59
  • @joel can you comment on my answer for you? does it help if not, can you try to be most specific on what you are looking for? there aren't many choices on data imputation, and I support you on not using mode to fill missing values. – Aiden Zhao Jan 26 '19 at 08:19
  • It's a bit scary to fill target column with some median :) You should try fancyimpute or alice (R) or missforest (R) - something like this. They will fill missing values in a smarter way: considering other columns. Maybe it will help to keep class balance. – avchauzov Jan 28 '19 at 07:42

1 Answers1

1

just build a model based on other features to predict it. which should maintain your distribution. and since your missing data is categorical, it doesn't make sense to use mean or median. and even if it is numeric, I still advise against it as using mean or median would make the distribution have less variance thus changing the distribution.

also, if you build a tree based model, it will be able to handle missing data. Decision tree, random forest, gbdt. see lightgbm, xgboost packages.

Aiden Zhao
  • 633
  • 4
  • 15