2

I have a list of floats I get from a machine learning algorithm. All these floats are between 0 and 1:

probs = [proba[0] for proba in self.classifier.predict_proba(x_test)]

probs is my list of floats. The predict_proba() function normally returns a numpy array. It takes about 9 seconds to get the list, and the list finally contains about 60k values.

I would like to scale, or normalize, all the values in the list against the highest value in the list.

Normally, I would do that:

maximum = max(probs)
list_values = [proba / maximum for proba in probs]

But for 60k values, it takes about 2 minutes. I would like to make it shorter.

Do you have any idea about how I could attend better performances ?

JPFrancoia
  • 4,866
  • 10
  • 43
  • 73

4 Answers4

2

If you don't mind using an external library, numpy might be worth looking into:

import numpy
probs = numpy.array([proba[0] for proba in self.classifier.predict_proba(x_test)])
list_values = probs/maximum
user3636636
  • 2,409
  • 2
  • 16
  • 31
  • 1
    I'm not sure if it is observable in a script language like Python, but generally division is much slower than multiplication. So it is better to multiply vector by (1.0 / maximum) instead of dividing. – stgatilov Jul 19 '15 at 18:22
0

Another approach using numpy, potentially faster if your list of probabilities is large, is to convert your whole probabilities to a numpy array, and then operate over it:

import numpy as np

probs = np.asarray(self.classifier.predict_proba(x_test))

list_values = probs[:, 0] / probs.max()

The first line will convert all your probabilities to a N x M array (where N is your samples and M your number of classes).

The second line will select all the probabilities for the first class ([:, 0] means all rows of the column 0, which yields a vector of size N) and divide it for the maximum.

You can potentially extend this to all your probabilities:

all_probs = probs / probs.max()

The above will normalize all your probabilities for all the classes. And later you can access them like all_probs[:, i] where i is the class of interest.

Imanol Luengo
  • 15,366
  • 2
  • 49
  • 67
0

You should use Scikit learn's normalize.

from sklearn.preprocessing import normalize
Geeocode
  • 5,705
  • 3
  • 20
  • 34
0

If you want your end results to be numpy.array , then it would be to faster to convert your list to numpy array before hand and to use array division directly , than list comprehension. Example -

import numpy as np
probsnp = np.array([proba[0] for proba in self.classifier.predict_proba(x_test)])
maximum = probs.max()
list_values = probs/maximum

Examples of timing tests -

In [46]: import numpy.random as ndr

In [47]: probs = ndr.random_sample(1000)

In [48]: probs.shape
Out[48]: (1000,)

In [49]: def func1(probs):
   ....:     maximum = max(probs)
   ....:     probsnew = [i/maximum for i in probs]
   ....:     return probsnew
   ....:

In [50]: def func2(probs):
   ....:     maximum = probs.max()
   ....:     probsnew = probs/maximum
   ....:     return probsnew
   ....:

In [51]: %timeit func1(probs)
The slowest run took 229.79 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 279 µs per loop

In [52]: %timeit func1(probs)
1000 loops, best of 3: 278 µs per loop

In [53]: %timeit func2(probs)
The slowest run took 356.45 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 81 µs per loop

In [54]: %timeit func1(probs)
1000 loops, best of 3: 278 µs per loop

In [55]: %timeit func2(probs)
10000 loops, best of 3: 81.5 µs per loop

The numpy method takes only 1/3rd time as that of list comprehension.


Timing tests with numpy.array() conversion as part of func2 (in above example) -

In [60]: probslist = [p for p in probs]

In [61]: def func2(probs):
   ....:     probsnp = np,array(probs)
   ....:     maxprobs = probsnp.max()
   ....:     probsnew = probsnp/maxprobs
   ....:     return probsnew
   ....:

In [65]: %timeit func1(probslist)
1000 loops, best of 3: 212 µs per loop

In [66]: %timeit func2(probslist)
10000 loops, best of 3: 198 µs per loop

In [67]: probs = ndr.random_sample(60000)

In [68]: probslist = [p for p in probs]

In [74]: %timeit func1(probslist)
100 loops, best of 3: 11.5 ms per loop

In [75]: %timeit func2(probslist)
100 loops, best of 3: 5.79 ms per loop

In [76]: %timeit func1(probslist)
100 loops, best of 3: 11.4 ms per loop

In [77]: %timeit func2(probslist)
100 loops, best of 3: 5.81 ms per loop

Seems like its still a little faster to use numpy array.

Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176