Performance behaviour of vectorized functions in numpy

Question

I want to perform a mathematical integration in python in the following way:

[1] Solve an implicit equation with the help of scipy.optimize.fsolve to find the maximum position of the integrand

[2] Shift the integrand's maximum to zero and integrate from -10 to 10 with the help of scipy.integrate.quad

Since the integrand has one free parameter xi, I want to perform this with a range of xi values with the help of numpy, so I use numpy.vectorize.

Now there are two ways of vectorizing this alghorithm:

[A] vectorize [1] and [2] separately and give the result of vec_[1] to vec_[2] as input

[B] vectorize a function that performs [1] and [2] at once

I noticed that [A] is much faster than [B]. Here is the code:

from scipy import optimize, integrate
import numpy as np
from math import exp, sqrt, pi
import time

def integral_with_fsolve(xi):
    xc = optimize.fsolve(lambda x: x+1./(1+exp(-x))-xi,0.)
    def integrand(x,xi):
        return exp(-(x-xi+xc)**2)/(2.*sqrt(2.*pi))/(1.+exp(x+xc))
    integral = integrate.quad(integrand,-10.,10.,args=(xi,),epsabs=0.)
    return integral[0]

def integral(xi,xc):
    def integrand(x,xi):
        return exp(-(x-xi+xc)**2)/(2.*sqrt(2.*pi))/(1.+exp(x+xc))
    integral = integrate.quad(integrand,-10.,10.,args=(xi,),epsabs=0.)
    return integral[0]

def fsolve(xi):
    return optimize.fsolve(lambda x: x+1./(1+exp(-x))-xi,0.)

vec_integral_with_fsolve = np.vectorize(integral_with_fsolve)
vec_integral = np.vectorize(integral)
vec_fsolve = np.vectorize(fsolve)

xi = np.linspace(0.,2.,1000)

t0=time.time()
something = vec_integral_with_fsolve(xi)
dur=(time.time()-t0)
speed = xi.size/dur
print('Integrate and fsolve vectorized in one: speed = {} ints/sec'.format(speed))

t0=time.time()
xc = vec_fsolve(xi)
something = vec_integral(xi,xc)
dur=(time.time()-t0)
speed = xi.size/dur
print('Integrate and fsolve vectorized seperately: speed = {} ints/sec'.format(speed))

and the output is always something like

Integrate and fsolve vectorized in one: speed = 298.151473998 ints/sec

Integrate and fsolve vectorized seperately: speed = 2136.75134429 ints/sec

Since this is only a simplified version of my actual problem, I need to know why it behaves like this. Could someone please explain? Thanks!

The difference you see is impressive, but it mostly cooks down to numpy not being very fast for scalars. You can add an `.item()` to make it the same speed. Maybe someone else wants to elaborate, but... — seberg, Jan 21 '13 at 16:15

score 0 · Answer 1 · edited May 23 '17 at 12:21

In summary, this is a problem with the xc variable when you use the "at once" approach. It is a ndarray that, when called in math.exp() with xor xi (both floats) turns the code to be much slower. If you make xc=float(xc) in the "at once" approach you will get practically the same performance as in the "separately" appraoch.

Below it follows a detailed description about how to find that out.

Using cProfile it is easy to see where the bottlenecks are:

AT ONCE:
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 1001    0.002    0.000    2.040    0.002 quadpack.py:135(quad)
 1001    0.001    0.000    2.038    0.002 quadpack.py:295(_quad)
 1001    0.002    0.000    2.143    0.002 tmp.py:15(integral_with_fsolve)
231231   1.776    0.000    1.925    0.000 tmp.py:17(integrand)
470780   0.118    0.000    0.118    0.000 {math.exp}
 1001    0.112    0.000    2.037    0.002 {scipy.integrate._quadpack._qagse}

SEPARATELY:
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 1001    0.001    0.000    0.340    0.000 quadpack.py:135(quad)
 1001    0.001    0.000    0.339    0.000 quadpack.py:295(_quad)
 1001    0.001    0.000    0.341    0.000 tmp.py:9(integral)
231231   0.200    0.000    0.278    0.000 tmp.py:10(integrand)
470780   0.054    0.000    0.054    0.000 {math.exp}
 1001    0.060    0.000    0.338    0.000 {scipy.integrate._quadpack._qagse}

The overall timing is:

AT ONCE:
1 loops, best of 3: 1.91 s per loop
SEPARATELY:
1 loops, best of 3: 312 ms per loop

The biggest difference is in integrand, which takes almost 7x longer to run in integral_with_fsolve. And the same happens to the numerical integration quad. Even math.exp is twice as fast in the "separately" approach.

It suggests that the types being evaluated in each approach is different. In fact that is the point. When running "at once" you can print type(xc) to see it is a numpy.ndarray, float64, while in the "separately" approach it is just a float64. It seems not to be a good idea to sum these types inside a math.exp(), as shown below:

xa = -0.389760856858
xc = np.array([[-0.389760856858]],dtype='float64')

timeit for i in range(1000000): exp(xc+xa)
#1 loops, best of 3: 1.96 s per loop

timeit for i in range(1000000): exp(xa+xa)
#10 loops, best of 3: 173 ms per loop

In both cases math.exp() returns a float. Using exp, sqrt and pi from numpy reduces the difference but it makes your code much slower, probably because these functions may also return a ndarray:

AT ONCE:
1 loops, best of 3: 4.46 s per loop
SEPARATELY:
1 loops, best of 3: 2.14 s per loop

To it seems a good idea NOT to convert to a ndarray in this case. The good idea is to convert to float then, as shown below (just necessary in the "at once" approach):

def integral_with_fsolve(xi):
    xc = optimize.fsolve(lambda x: x+1./(1+exp(-x))-xi,0.)
    xc = float(xc) # <-- SEE HERE
    def integrand(x,xi):
        return exp(-(x-xi+xc)**2)/(2.*sqrt(2.*pi))/(1.+exp(x+xc))
    integral = integrate.quad(integrand,-10.,10.,args=(xi,),epsabs=0.)
    return integral[0]

with the new timing:

AT ONCE:
1 loops, best of 3: 321 ms per loop
SEPARATELY:
1 loops, best of 3: 315 ms per loop

Performance behaviour of vectorized functions in numpy

1 Answers1

Linked