What to do when the training data has classification labels but the required task is probabillities?

Question

In a machine learning project, I have some training data about clients of a company that includes 20 input features and a label representing clients' feedback to a marketing campaign in the shape of Yes/No answers:

c1 => {f1_1,f2_1,...,f20_1} {Yes}

c2 => {f1_2,f2_2,...,f20_2} {No}

The requirement is to predict the 'acceptance probability' of each client to the campaign.

So, the training data has a binary classification label, while the requirement is a regression prediction.

I was able to extract the amount of the correlation of each feature w.r.t. the classification label.

Does it make sense to apply so-called importance weights to the features based on the strength of their correlation with the classification label and apply those weights on features' values to produce something like scoring rate for each client and use them as the regression label?

c1_score = w1(f1_1) + w2(f2_1) + ... + w20(f20_1)

c2_score = w1(f1_2) + w2(f2_2) + ... + w20(f20_2)

If not, is there any other suggestion?

Use logistic regression. it gives you probabilities. – Bhanu Tez Mar 06 '19 at 19:36 — Bhanu Tez, Mar 06 '19 at 19:36

desertnaut · Accepted Answer · 2019-03-06T20:49:40.177

The requirement is to predict the 'acceptance probability' of each client to the campaign.

So, the training data has a binary classification label, while the requirement is a regression prediction.

Most certainly not.

Your task is definitely a classification one.

Most classifiers out there do not actually produce a "hard" label 0/1 as output; what they do produce by default is probabilities, which are subsequently converted into hard labels via a thresholding operation (e.g. if the probability p > 0.5, declare 1, otherwise declare 0).

Now, sometimes it happens that the business problem, for whatever reason, requires exactly these probabilities instead of the hard labels (your case is such one, as is the vast majority of the classification contests in Kaggle); this certainly does not change anything in the methodology (it is still a classification problem), apart from removing the requirement for this final thresholding operation - which in any case is not part of the statistics part of the problem, as the answer to this Cross Validated thread correctly points out:

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

So, you have nothing more to do than employing your usual classification algorithms of choice, be it logistic regression, random forest etc., and just using the respective method to get back probabilities instead of class labels (e.g. the predict_proba method for logistic regression in scikit-learn, and similarly for other platforms/algorithms).

You may also find the following answers of mine (and the links therein) useful:

Great explanation and very nice references to learn more, thanks a lot. — Ali Sharifi B., Mar 06 '19 at 20:55

What to do when the training data has classification labels but the required task is probabillities?

1 Answers1