I'm trying to quickly check how many items in a list are below a series of thresholds, similar to doing what's described here but a lot of times. The point of this is to do some diagnostics on a machine learning model that are a little more in depth than what is built in to sci-kit learn (ROC curves, etc.).
Imagine preds
is a list of predictions (probabilities between 0 and 1). In reality, I will have over 1 million of them, which is why I'm trying to speed this up.
This creates some fake scores, normally distributed between 0 and 1.
fake_preds = [np.random.normal(0, 1) for i in range(1000)]
fake_preds = [(pred + np.abs(min(fake_preds)))/max(fake_preds + np.abs(min(fake_preds))) for pred in fake_preds]
Now, the way I am doing this is to loop through 100 threshold levels and check how many predictions are lower at any given threshold:
thresholds = [round(n,2) for n in np.arange(0.01, 1.0, 0.01)]
thresh_cov = [sum(fake_preds < thresh) for thresh in thresholds]
This takes about 1.5 secs for 10k (less time than generating the fake predictions) but you can imagine it takes a lot longer with a lot more predictions. And I have to do this a few thousand times to compare a bunch of different models.
Any thoughts on a way to make that second code block faster? I'm thinking there must be a way to order the predictions to make it easier for the computer to check the thresholds (similar to indexing in SQL-like scenario) but I can't figure out any other way than sum(fake_preds < thresh)
to check them, and that doesn't take advantage of any indexing or ordering.
Thanks in advance for the help!