1

I'm new on this, but I'd like to plot a ROC curve for a small dataset of active compounds versus decoys. I based myself on this link: ROC curve for binary classification in python In this case, this small dataset is a result of a virtual screening that ranked and scored the compounds with known activity or inactivity from experimental data (IC50).

I'm not sure if the plot and the AUC are correct. I noticed that even if there was only one-value difference between the test (true) predicted values, the AUC was only 0.5. For the true and predicted values in the code I inserted below, it was around 0.49 only. Perhaps the model was not properly identifying the compounds. However, I noticed that for the first ten compounds in the rank, it identified correctly, besides some in other positions. Maybe it better identified active compounds than negative ones, or maybe it was because there were more active compounds to be considered. Also, would it be better to use another classification system for the tested and predicted values, other than a binary classification? For example, ranking the IC50 values from best to worst and comparing with the virtual screening rank, creating a score for the true and predicted results, considering the similarity between the ranks of each compound (for IC50 and virtual screening)?

I also thought in doing a precision-recall curve, considering the data imbalance between the quantity of active compounds and decoys.

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, roc_auc_score
test = [1,1,1,1,1,1,1,1,1,1,0,1,1,0,1,0,1,1,0,1,0,1,1,1]
pred = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0]
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(2):
    fpr[i], tpr[i], _ = roc_curve(test, pred)
    roc_auc[i] = auc(fpr[i], tpr[i])

print(roc_auc_score(test, pred))
plt.figure()
plt.plot(fpr[1], tpr[1])
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.show()
  • 1
    What is your question exactly? – Calimo Mar 17 '21 at 13:49
  • I don't know if the code is generating the ROC curve correctly, or if there is a better way to code for this. – Camila Fonseca Amorim da Silva Mar 18 '21 at 15:03
  • Then please edit your question to reflect that. At the moment it looks more like a rant about your data than a question. – Calimo Mar 18 '21 at 15:36
  • I added the main question in the title — the issue I mention in the beginning of the second paragraph. Maybe it was not clear because I didn't include a question mark, as in the other two questions in the same paragraph. It might have seem like rant because I thought in including more details (observations/thoughts) about the situation with the questions/concerns altogether. – Camila Fonseca Amorim da Silva Mar 19 '21 at 16:40

1 Answers1

2

The code required to plot the ROC curve is very similar but simpler than yours. There is no need to store fpr and tpr as dictionaries, they are arrays. I think the problem is your predictions are absolute True/False, and not a probability that can be used to generate the threshold values using the roc_curve function. I changed the pred values to a probability (> 0.5 is True, < 0.5 is False) and the curve now looks closer to what you probably expect. Also, only 66% of the predictions are correct, and that makes the curve be relatively close to the 'no-discrimination' line (random event with 50% probability).

test = [1,1,1,1,1,1,1,1,1,1,0,1,1,0,1,0,1,1,0,1,0,1,1,1]
pred = [0.91,0.87,0.9,0.75,0.85,0.97,0.99,0.98,0.66,0.97,0.98,0.57,0.89,0.62,0.93,0.97,0.55,0.99,0.11,0.84,0.45,0.35,0.3,0.39]

fpr, tpr, _ = roc_curve(test, pred)
roc_auc = auc(fpr, tpr)

print(roc_auc_score(test, pred))
plt.figure()
plt.plot(fpr, tpr)
plt.plot([0.0, 1.0], [0.0, 1.0], ls='--', lw=0.3, c='k')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.show()

Now the AUC value is 0.5842105263157894.

Plot from code above

Carlos Melus
  • 1,472
  • 2
  • 7
  • 12