Missing values in target label

Question

I would like to fill the missing values (18543) present in the target column/Dependent variable, Complaint-Status, in my data having class imbalance. There are five classes in the target column (multi-class classification problem).

What is the best way to fill these values without increasing the class imbalance?

Dataset

Replacing these missing values with mode of column i.e. 'Closed with explanation', will add to class imbalance only.

uniq, kounts = np.unique(df_ohe['Complaint-Status'], return_counts=True) 
print(np.asarray((uniq, kounts)).T)

[['' 18543]
 ['Closed' 809]
 ['Closed with explanation' 34300]
 ['Closed with monetary relief' 2818]
 ['Closed with non-monetary relief' 5018]
 ['Untimely response' 321]]

The target class percentage

100*c_count.values/c_count.values.sum()
# array([55.49353654, 30.00048537,  8.11855879,  4.55920659,  1.30887088,
        0.51934184])

Expected Output :

[['class_label', 18543]
 ['Closed' 809]
 ['Closed with explanation' 34300]
 ['Closed with monetary relief' 2818]
 ['Closed with non-monetary relief' 5018]
 ['Untimely response' 321]]

It's unclear what you're asking exactly. What are your inputs, and what do you expect your code to do? — Jordan Singer, Jan 25 '19 at 21:15
I'm not sure what `Need_to_impute_values_here` means. What needs to go here? What does it mean to "impute" in this context? — Jordan Singer, Jan 25 '19 at 21:27
Sorry. It would be the target class label like 'Closed'. Imputation == fill missing target label here — joel, Jan 25 '19 at 21:32
The "Best" way is going to be entirely dependent on your overall data. You can use the mean, median, mode, or more advanced methods like those available from [fancyimpute](https://stackoverflow.com/questions/45239256/data-imputation-with-fancyimpute-and-pandas) — G. Anderson, Jan 25 '19 at 21:34
I'm a little confused. Is this train/test data with the Dependent variable missing and you're trying to fill that in "randomly"? If you're trying to build a model on incomplete data, you will get a bad model (or worse a model that *looks* good and then goes rouge on validation data - there goes a few hours). Please clarify what you are trying to accomplish — MattR, Jan 25 '19 at 21:45
@MattR Yes it is the dependent variable that is missing. Earlier I was thinking to fill with mode. But I believe it is wrong. — joel, Jan 26 '19 at 05:59
@joel can you comment on my answer for you? does it help if not, can you try to be most specific on what you are looking for? there aren't many choices on data imputation, and I support you on not using mode to fill missing values. — Aiden Zhao, Jan 26 '19 at 08:19
It's a bit scary to fill target column with some median :) You should try fancyimpute or alice (R) or missforest (R) - something like this. They will fill missing values in a smarter way: considering other columns. Maybe it will help to keep class balance. — avchauzov, Jan 28 '19 at 07:42

score 1 · Answer 1 · answered Jan 26 '19 at 01:21

just build a model based on other features to predict it. which should maintain your distribution. and since your missing data is categorical, it doesn't make sense to use mean or median. and even if it is numeric, I still advise against it as using mean or median would make the distribution have less variance thus changing the distribution.

also, if you build a tree based model, it will be able to handle missing data. Decision tree, random forest, gbdt. see lightgbm, xgboost packages.

Missing values in target label

1 Answers1