Fastest way to scale a list of floats

Question

I have a list of floats I get from a machine learning algorithm. All these floats are between 0 and 1:

probs = [proba[0] for proba in self.classifier.predict_proba(x_test)]

probs is my list of floats. The predict_proba() function normally returns a numpy array. It takes about 9 seconds to get the list, and the list finally contains about 60k values.

I would like to scale, or normalize, all the values in the list against the highest value in the list.

Normally, I would do that:

maximum = max(probs)
list_values = [proba / maximum for proba in probs]

But for 60k values, it takes about 2 minutes. I would like to make it shorter.

Do you have any idea about how I could attend better performances ?

Is this all the information we need to know? scaling a list of 60,000 floats between 0 and 1 is insanely fast on my 7 year old machine. — David Aurelio, Jul 19 '15 at 16:16

score 2 · Accepted Answer · answered Jul 19 '15 at 11:10

2

If you don't mind using an external library, numpy might be worth looking into:

import numpy
probs = numpy.array([proba[0] for proba in self.classifier.predict_proba(x_test)])
list_values = probs/maximum

answered Jul 19 '15 at 11:10

user3636636

2,409
2
16
31

1

I'm not sure if it is observable in a script language like Python, but generally division is much slower than multiplication. So it is better to multiply vector by (1.0 / maximum) instead of dividing. – stgatilov Jul 19 '15 at 18:22

score 0 · Answer 2 · answered Jul 19 '15 at 11:21

Another approach using numpy, potentially faster if your list of probabilities is large, is to convert your whole probabilities to a numpy array, and then operate over it:

import numpy as np

probs = np.asarray(self.classifier.predict_proba(x_test))

list_values = probs[:, 0] / probs.max()

The first line will convert all your probabilities to a N x M array (where N is your samples and M your number of classes).

The second line will select all the probabilities for the first class ([:, 0] means all rows of the column 0, which yields a vector of size N) and divide it for the maximum.

You can potentially extend this to all your probabilities:

all_probs = probs / probs.max()

The above will normalize all your probabilities for all the classes. And later you can access them like all_probs[:, i] where i is the class of interest.

score 0 · Answer 3 · answered Jul 19 '15 at 11:23

0

You should use Scikit learn's normalize.

from sklearn.preprocessing import normalize

answered Jul 19 '15 at 11:23

Geeocode

5,705
3
20
34

When should I use it ? After calculating the probabilities, or before ? – JPFrancoia Jul 19 '15 at 11:31
No matter, but it's more straightforward to use after you have a list of probs. But, how you get probs bigger then 1.0? – Geeocode Jul 19 '15 at 11:38

Anand S Kumar · Answer 4 · 2015-07-19T11:30:42.030

If you want your end results to be numpy.array , then it would be to faster to convert your list to numpy array before hand and to use array division directly , than list comprehension. Example -

import numpy as np
probsnp = np.array([proba[0] for proba in self.classifier.predict_proba(x_test)])
maximum = probs.max()
list_values = probs/maximum

Examples of timing tests -

In [46]: import numpy.random as ndr

In [47]: probs = ndr.random_sample(1000)

In [48]: probs.shape
Out[48]: (1000,)

In [49]: def func1(probs):
   ....:     maximum = max(probs)
   ....:     probsnew = [i/maximum for i in probs]
   ....:     return probsnew
   ....:

In [50]: def func2(probs):
   ....:     maximum = probs.max()
   ....:     probsnew = probs/maximum
   ....:     return probsnew
   ....:

In [51]: %timeit func1(probs)
The slowest run took 229.79 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 279 µs per loop

In [52]: %timeit func1(probs)
1000 loops, best of 3: 278 µs per loop

In [53]: %timeit func2(probs)
The slowest run took 356.45 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 81 µs per loop

In [54]: %timeit func1(probs)
1000 loops, best of 3: 278 µs per loop

In [55]: %timeit func2(probs)
10000 loops, best of 3: 81.5 µs per loop

The numpy method takes only 1/3rd time as that of list comprehension.

Timing tests with numpy.array() conversion as part of func2 (in above example) -

In [60]: probslist = [p for p in probs]

In [61]: def func2(probs):
   ....:     probsnp = np,array(probs)
   ....:     maxprobs = probsnp.max()
   ....:     probsnew = probsnp/maxprobs
   ....:     return probsnew
   ....:

In [65]: %timeit func1(probslist)
1000 loops, best of 3: 212 µs per loop

In [66]: %timeit func2(probslist)
10000 loops, best of 3: 198 µs per loop

In [67]: probs = ndr.random_sample(60000)

In [68]: probslist = [p for p in probs]

In [74]: %timeit func1(probslist)
100 loops, best of 3: 11.5 ms per loop

In [75]: %timeit func2(probslist)
100 loops, best of 3: 5.79 ms per loop

In [76]: %timeit func1(probslist)
100 loops, best of 3: 11.4 ms per loop

In [77]: %timeit func2(probslist)
100 loops, best of 3: 5.81 ms per loop

Seems like its still a little faster to use numpy array.

Fastest way to scale a list of floats

4 Answers4