Generate High, Medium, Low categories from a skewed distribution

Question

I have been working on a Churn Prediction use case in Python using XGBoost. The data trained on various parameters like Age, Tenure, Last 6 months income etc gives us the prediction if an employee is likely to leave based on its employee ID. Additionally, if the user wants to the see why this ML system categorised the employee as such, the user can see the features that contributed to this, which are extracted form the model via eli5 library. So to make this more explainable to the users, we had created some ranges for each feature:

Tenure (in days)
[0-100]   = High Risk
[101-300] = Medium Risk
[301-800] = Low Risk

To define these ranges we've analysed the distributions of each feature and manually defined the ranges for our use in the system. We saw the impact of each feature on the target variable IsTerminated in training data. Following is an example of Tenure distribution.

Here the green bar represents the employees who are terminated or left and pink represents those who didn't.

So the question is that, as time passes and new data would be added to the model the such features' risk ranges would change. In this case of Tenure, if an employee has tenure of 780 days, after a month his tenure feature would show 810. Obviously, we keep the upper end on "Low Risk" as open ended. But real problem is, how can we define the internal boundaries / ranges programtically ?

hzanoli · Accepted Answer · 2020-06-24T14:34:54.790

EDIT: Thanks for the clarification. I have changed the answer.

It is important to realize that you are trying to project a selection in multi-dimensional space into a 1D space. Not in every case you will be able to see a clear separation like the one you got. There are also various possibilities to do that, here I made a simple example that could help your client to interpret the model, but does not represent the full complexity of the model, of course.

You did not provide any sample data, so I will generate some from the breast cancer dataset.

First let's import what we need:

from sklearn import datasets
from xgboost import XGBClassifier
import pandas as pd
import numpy as np

And now import the dataset and train a very simple XGBoost Model

cancer = datasets.load_breast_cancer()

X = cancer.data
y = cancer.target

xgb_model = XGBClassifier(n_estimators=5,
                          objective="binary:logistic", 
                          random_state=42)
xgb_model.fit(X, y)

y_prob = pd.DataFrame(xgb_model.predict_proba(X))[0]

There are multiple ways to solve this.

One approach is to bin in the probability given by the model. So you will decide which probabilities you consider to be "High Risk", "Medium Risk" and "Low Risk" and the intervals on data can be classified. In this example I considered low to be 0 <= p <= 0.5, medium for 0.5 < p <= 0.8 and high for 0.8 < p <= 1.

First you have to calculate the probability for each prediction. I would suggest to maybe use the test set for that, to avoid bias from a possible model overfitting.

y_prob = pd.DataFrame(xgb_model.predict_proba(X))[0]
df = pd.DataFrame(X, columns=cancer.feature_names)
# Stores the probability of a malignant cancer
df['probability'] = y_prob

Then you have to bin your data and calculate average probabilities for each of those bins. I would suggest to bin your data using np.histogram_bin_edges automatic calculation:

def calculate_mean_prob(feat):
    """Calculates mean probability for a feature value, binning it."""
    # Bins from the automatic rules from numpy, check docs for details
    bins = np.histogram_bin_edges(df[feat], bins='auto')
    binned_values = pd.cut(df[feat], bins)
    return df['probability'].groupby(binned_values).mean()

Now you can classify each bin following what you would consider to be a low/medium/high probability:

def classify_probability(prob, medium=0.5, high=0.8, fillna_method= 'ffill'):
    """Classify the output of each bin into a risk group, 
       according to the probability.
    
    Following the follow rules:
    0 <= p <= medium: Low risk
    medium < p <= high: Medium risk
    high < p <= 1: High Risk
    
    If a bin has no entries, it will be filled using fillna with the method
    specified in fillna_method
    """
    risk = pd.cut(prob, [0., medium, high, 1.0], include_lowest=True, 
                  labels=['Low Risk', 'Medium Risk', 'High Risk'])
    
    risk.fillna(method=fillna_method, inplace=True)
    
    return risk

This will return you the risk for each bin that you divided your data. Since you will probably have multiple bins that have consecutive values, you might want to merge the consecutive pd.Interval bins. The code for that is shown below:

def sum_interval(i1, i2):
    if i2 is None:
        return None
    if i1.right == i2.left:
        return pd.Interval(i1.left, i2.right)
    return None

def sum_intervals(args):
    """Given a list of pd.Intervals, 
       returns a list summing consecutive intervals."""
    result = list()
    current_interval = args[0]
    
    for next_interval in list(args[1:]) + [None]:
        # Try to sum the current interval and nex interval
        # The None in necessary for the last interval
        sum_int = sum_interval(current_interval, next_interval)
        
        if sum_int is not None:
            # Update the current_interval in case if it is
            # possible to sum
            current_interval = sum_int
        else:
            # Otherwise tries to start a new interval 
            result.append(current_interval)
            current_interval = next_interval
    if len(result) == 1:
        return result[0]
    
    return result

def combine_bins(df):
    # Group them by label
    grouped = df.groupby(df).apply(lambda x: sorted(list(x.index)))
    # Sum each category in intervals, if consecutive
    merged_intervals = grouped.apply(sum_intervals)
    return merged_intervals

Now you can combine all the functions to calculate the bins for each feature:

def generate_risk_class(feature, medium=0.5, high=0.8):
    mean_prob = calculate_mean_prob(feature)
    classification = classify_probability(mean_prob, medium=medium, high=high)
    merged_bins = combine_bins(classification)
    return merged_bins

For example, generate_risk_class('worst radius') results in:

Low Risk          (7.93, 17.3]
Medium Risk     (17.3, 18.639]
High Risk      (18.639, 36.04]

But in case you get features which are not so good discriminators (or that do not separate the high/low risk linearly), you will have more complicated regions. For example generate_risk_class('mean symmetry') results in:

Low Risk       [(0.114, 0.209], (0.241, 0.249], (0.272, 0.288]]
Medium Risk    [(0.209, 0.225], (0.233, 0.241], (0.249, 0.264]]
High Risk      [(0.225, 0.233], (0.264, 0.272], (0.288, 0.304]]

Thank you for your input. I know about binning a continuous variable, which we've done until now. What I want is to eliminate the need of defining boundaries. In case of your example, I want to extract "bins" list from the distribution itself — Gursharan Singh, Jun 24 '20 at 10:05
I have changed to an example on how to determine the bins. I hope it helps! — hzanoli, Jun 24 '20 at 14:35
Thanks hzanoli. This solution seems legit to me. Let me try this out and get back if I have any other query. — Gursharan Singh, Jun 25 '20 at 10:28
Now, the problem here is that I am working with 10 features and for each feature, firstly I have to choose the probability range that indicates High, Medium and Low risk categories. Secondly, these probabilities are subject to the changes in the data. In your example, you have chosen breast cancer data-set which is pretty much static but for a continuous scaling data, in production, I think this will fail. What’s your opinion ? — Gursharan Singh, Jul 02 '20 at 05:15

Generate High, Medium, Low categories from a skewed distribution

1 Answers1