Implementing the algorithm to maximize number of true positives
I would not recommend to do this (see the discussion in the end) but based on what I understood is that you want to maximize the number of true positives. Therefore you want to create a custom scorer and use TPOT to optimize the true positive rate. I optimized your function since it is depending on a given number k
. This can be avoided if you simply calculate the true positive rate. I used an example dataset from sklearn, which of course can be replaced with any other.
import numpy as np
import sklearn
from sklearn.metrics import make_scorer
import tpot
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
def maximize_true_pos(y, y_pred):
# true positives are marked with ones, all others with zeros
true_pos = np.where((y==1) & (y_pred == 1), 1, 0)
# sum true positives
num_true_pos = np.sum(true_pos)
# determine the true positive rate, how many of the positives were found?
true_pos_div_total_tp = num_true_pos/np.sum(y)
return true_pos_div_total_tp
iris = sklearn.datasets.load_breast_cancer()
# create the custom scorer
max_true_pos_scorer = make_scorer(maximize_true_pos)
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
train_size=0.75, test_size=0.25)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
tpot = TPOTClassifier(verbosity=2, max_time_mins=2, scoring=max_true_pos_scorer)
tpot.fit(X_train, y_train)
y_pred = tpot.predict(X_test)
Discussion of results and methodology
Now let's understand what was optimized here by looking at y_pred
.
y_pred
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Since we only wanted to optimize the number of true positives, the algorithm learnt that false positives are not punished and therefore set everything to class 1
(though y_true
is not always 1
, therefore accuracy < 1). Depending on your use case recall (how many positively labeled cases are found) or precision (how many of positively labeled cases are relevant) are better metrics than simply getting the algorithm to learn that it should label everything as positive.
To use precision or recall (you probably know that but I still put it in here for the sake of completeness) one can simply give "precision"
or "recall"
as the scoring argument in the following fashion:
TPOTClassifier(scoring='recall')