I've trained a model and identified a 'threshold' that I'd like to deploy it at, but I'm having trouble understanding how the threshold relates to the score.
X = labeled_data[features].reset_index(drop=True)
Y = np.array(labeled_data['fraud'].reset_index(drop=True))
# (train/test etc.. settle on an acceptable model)
grad_des = SGDClassifier(alpha=alpha_optimum, l1_ratio=l1_optimum, loss='log')
grad_des.fit(X, Y)
score_Y = grad_des.predict_proba(X)
precision, recall, thresholds = precision_recall_curve(Y, score_Y[:,1])
Alright, so now I plot precision and recall vs threshold and decide I want my threshold to be .4
What is threshold?
My model coefficients, which I understand are 'scoring' events by computing coefficients['x']*event_values['x']
, sum up to 29. Threshold is between 0 and 1.
How am I to understand the translation from threshold to what is, I guess a raw score? Would an event with a 1
for all features (all are binary) have a calculated score of 29 since that is the sum of all coefficients?
Do I need to compute this 'raw' score metric for all events and then plot that against precision instead of threshold?
Edit and Update:
So my question hinged on a lack of understanding about the logistic function, as Mikhail Korobov pointed out below. Regardless of 'raw score' the logistic function forces a value in [0, 1] range.
In order to 'unwrap' that value back into the 'raw score' I was looking for, I can do scipy.special.logit(0.8) - grad_des.intercept_
and this returns the 'score' of the row.