1

I am trying to implement a top decile recall/precision scoring function to insert into gridsearchCV. However, I am unable to figure out what is wrong. What I would like to do is to have my scoring function take in the probability prediction, actual label and ideally the decile threshold in percentage. I would then rank order the scores and then identify the conversion rate within the decile threshold. E.g. the conversion rate of the top 10% of the population. That conversion rate would be the score that I output. THe higher the better. However, when I run the code below, I dont get the probability scores and I dont understand what the input to the scoring function is. The print statements below return only 1's and 0's instead of probabilities.

def top_decile_conversion_rate(y_prob, y_actual):
    # Function goes in here
    print y_prob, y_actual
    return 0.5


features = pd.DataFrame({"f1":np.random.randint(1,1000,500) , "f2":np.random.randint(1,1000,500), 
                         "label":[round(x) for x in np.random.random_sample(500)]})


my_scorer = make_scorer(top_decile_conversion_rate, greater_is_better=True)
gs = grid_search.GridSearchCV(
    estimator=LogisticRegression(),
    param_grid={'C': [i for i in range(1, 3)], 'class_weight': [None], 'penalty':['l2']},
    cv=2,
    scoring=my_scorer ) 
model = gs.fit(features[["f1","f2"]], features.label)
SriK
  • 1,011
  • 1
  • 15
  • 29
  • 1
    In the make_scorer() the scoring function should have a signature `(y_true, y_pred, **kwargs)` which seems to be opposite in your case. Also, what is your `top_decile_conversion_rate` returning? Please add these details. – Vivek Kumar Oct 05 '17 at 10:01
  • 1
    top_decile_conersion_rate would be returning a conversion rate that is a number between 0 and 1. Yes, the signature is that but i dont see the predictions being passed into that function. – SriK Oct 06 '17 at 06:49
  • What do you mean by "i dont see the predictions being passed into that function"? Can you please explain? The predictions will be passed internally to that function. Also can you add the full `top_decile_conersion_rate` function so that I can debug – Vivek Kumar Oct 06 '17 at 07:18
  • upadated code to reflect the full code to debug. Notice that the print statements only print out 1s and 0s and never any prediction probabilities – SriK Oct 06 '17 at 07:32
  • Just noticed the needs_proba parameter! Its all good now. Thanks @VivekKumar – SriK Oct 06 '17 at 07:38
  • would you mind posting the `top_decile_conversion_rate`? Would like to take a look, as I currently face a similar challenge. – stats-hb Jun 03 '19 at 08:50
  • Ah, yeah, i didnt include that since its not really part of the coding question here and more of a data science question. Can you create a different question for the problem you have and I can respond there? – SriK Jun 04 '19 at 18:20

1 Answers1

4

The solution is in adding a parameter called needs_proba=True in the make_scorer function! This works ok.

def top_decile_conversion_rate(y_prob, y_actual):
    # Function goes in here
    print "---prob--"
    print y_prob
    print "---actual--"
    print y_actual
    print "---end--"

    return 0.5


features = pd.DataFrame({"f1":np.random.randint(1,1000,500) , "f2":np.random.randint(1,1000,500), 
                         "label":[round(x) for x in np.random.random_sample(500)]})


my_scorer = make_scorer(top_decile_conversion_rate, greater_is_better=True,needs_proba=True)
gs = grid_search.GridSearchCV(
    estimator=LogisticRegression(),
    param_grid={'C': [i for i in range(1, 3)], 'class_weight': [None], 'penalty':['l2']},
    cv=20,
    scoring=my_scorer ) 
model = gs.fit(features[["f1","f2"]], features.label)
SriK
  • 1,011
  • 1
  • 15
  • 29
  • what about DEBUG with `print()` function in regression type scorer? I try to make it, but `print()` not working. I want to show y_pred and y_true values, but can't – Ivan Lipko Jul 12 '19 at 16:00
  • for previous see [https://stackoverflow.com/questions/3807694/how-to-change-the-message-in-a-python-assertionerror] – Ivan Lipko Jul 12 '19 at 17:00
  • You would need to select the winning model from gridsearch and then call the predict function in order to get your predictions. Something like gs.best_estimator_.predict(X) – SriK Jul 19 '19 at 00:11