5

Suppose I have an array: adata=array([0.5, 1.,2.,3.,6.,10.]) and I want to calculate log likelihood of Weibull distribution of this array, given the parameters [5.,1.5] and [5.1,1.6]. I have never thought I need to write my own Weibull probability density functions for this task, as it is already provided in scipy.stat.distributions. So, this ought to do it:

from scipy import stats
from numpy import *
adata=array([0.5, 1.,2.,3.,6.,10.])
def wb2LL(p, x): #log-likelihood of 2 parameter Weibull distribution
    return sum(log(stats.weibull_min.pdf(x, p[1], 0., p[0])), axis=1)

And:

>>> wb2LL(array([[5.,1.5],[5.1,1.6]]).T[...,newaxis], adata)
array([-14.43743911, -14.68835298])

Or I reinvent the wheel and write a new Weibull pdf function, such as:

wbp=lambda p, x: p[1]/p[0]*((x/p[0])**(p[1]-1))*exp(-((x/p[0])**p[1]))
def wb2LL1(p, x): #log-likelihood of 2 paramter Weibull
    return sum(log(wbp(p,x)), axis=1)

And:

>>> wb2LL1(array([[5.,1.5],[5.1,1.6]]).T[...,newaxis], adata)
array([-14.43743911, -14.68835298])

Admittedly, I always take it for granted that if something is already in scipy, it should be very well optimized and re-inventing the wheel is seldom going to make it faster. But here comes the surprise: if I timeit, 100000 calls of wb2LL1(array([[5.,1.5],[5.1,1.6]])[...,newaxis], adata) takes 2.156s while 100000 calls of wb2LL(array([[5.,1.5],[5.1,1.6]])[...,newaxis], adata) takes 5.219s, the build-in stats.weibull_min.pdf() is more than twice slower.

Checking the source code python_path/sitepackage/scipy/stat/distributions.py didn't provides an easy answer, at least for me. If anything, from it I would expect stats.weibull_min.pdf() to be almost as fast as wbp().

Relevant source code: line 2999-3033:

class frechet_r_gen(rv_continuous):
    """A Frechet right (or Weibull minimum) continuous random variable.

    %(before_notes)s

    See Also
    --------
    weibull_min : The same distribution as `frechet_r`.
    frechet_l, weibull_max

    Notes
    -----
    The probability density function for `frechet_r` is::

        frechet_r.pdf(x, c) = c * x**(c-1) * exp(-x**c)

    for ``x > 0``, ``c > 0``.

    %(example)s

    """
    def _pdf(self, x, c):
        return c*pow(x,c-1)*exp(-pow(x,c))
    def _logpdf(self, x, c):
        return log(c) + (c-1)*log(x) - pow(x,c)
    def _cdf(self, x, c):
        return -expm1(-pow(x,c))
    def _ppf(self, q, c):
        return pow(-log1p(-q),1.0/c)
    def _munp(self, n, c):
        return special.gamma(1.0+n*1.0/c)
    def _entropy(self, c):
        return -_EULER / c - log(c) + _EULER + 1
frechet_r = frechet_r_gen(a=0.0, name='frechet_r', shapes='c')
weibull_min = frechet_r_gen(a=0.0, name='weibull_min', shapes='c')

So, the question is: is stats.weibull_min.pdf() really slower? If so, how come?

B--rian
  • 5,578
  • 10
  • 38
  • 89
CT Zhu
  • 52,648
  • 17
  • 120
  • 133
  • 1
    I might have some numerical stability concerns with that hand-written one. Compensating for that might eat some time in the built-in function. – user2357112 Aug 25 '13 at 17:30
  • 1
    I feel the same, kinda lose faith on scipy/python. The library function wasn't as advanced as what I expected. – colinfang Mar 01 '14 at 14:33
  • 1
    @colinfang, In `scipy`, `%timeit ss.weibull_min.pdf(0.1*np.arange(1,11),5.,scale=1)` reports `1000 loops, best of 3: 595 µs per loop`. In `R`, `benchmark(pweibull((1:10)*0.1, 5 ,1), replications=10^6)` reports 22.83 s in total. As a programming language, `python` is arguably much better than `R`. But many times, I do also wish the `scipy` library function can be faster. – CT Zhu Mar 02 '14 at 02:53

1 Answers1

2

The pdf() method is defined in rv_continuous class, which calls frechet_r_gen._pdf(). the code of pdf() is:

def pdf(self,x,*args,**kwds):
    loc,scale=map(kwds.get,['loc','scale'])
    args, loc, scale = self._fix_loc_scale(args, loc, scale)
    x,loc,scale = map(asarray,(x,loc,scale))
    args = tuple(map(asarray,args))
    x = asarray((x-loc)*1.0/scale)
    cond0 = self._argcheck(*args) & (scale > 0)
    cond1 = (scale > 0) & (x >= self.a) & (x <= self.b)
    cond = cond0 & cond1
    output = zeros(shape(cond),'d')
    putmask(output,(1-cond0)+np.isnan(x),self.badvalue)
    if any(cond):
        goodargs = argsreduce(cond, *((x,)+args+(scale,)))
        scale, goodargs = goodargs[-1], goodargs[:-1]
        place(output,cond,self._pdf(*goodargs) / scale)
    if output.ndim == 0:
        return output[()]
    return output

So, it has many argument processing code, which make it slow.

HYRY
  • 94,853
  • 25
  • 187
  • 187
  • thanks! Never thought these will cause it becoming twice as slow. Now it makes sense that those lines of argument check eat up so much time coz they are executed every `.pdf()` call. If I have a large optimization problem that requires a lot of `.pdf()` calls and I known in advance that my parameter space is well confined inside the supported region, I should ditch the build-in `.pdf()` for better performance. Right? – CT Zhu Aug 26 '13 at 03:12