2

My use case is to evaluate Poisson pmf on all points which is less than say, 10, and I would call such function multiple of times with difference lambdas. The lambdas are not known ahead of time so I cannot vectorize lambdas.

I heard from somewhere about a secret trick which is to use _pmf. What is the downside to do so? But still, it is a bit slow, is there any way to improve it without rewriting the pmf in C from scratch?

%timeit scipy.stats.poisson.pmf(np.arange(0,10),3.3)
%timeit scipy.stats.poisson._pmf(np.arange(0,10),3.3)
a = np.arange(0,10)
%timeit scipy.stats.poisson._pmf(a,3.3)

10000 loops, best of 3: 94.5 µs per loop
100000 loops, best of 3: 15.2 µs per loop
100000 loops, best of 3: 13.7 µs per loop

Update

Ok, simply I was just too lazy to write in cython. I had expected there is a faster solution for all discrete distribution that can be evaluated sequentially (iteratively) for consecutive x. E.g. P(X=3) = P(X=2) * lambda / 3 if X ~ Pois(lambda)

Related: Is the build-in probability density functions of `scipy.stat.distributions` slower than a user provided one?

I have less faith in Scipy and Python now. The library function isn't as advanced as what I had expected.

Community
  • 1
  • 1
colinfang
  • 20,909
  • 19
  • 90
  • 173

3 Answers3

5

Most of scipy.stats distributions support vectorized evaluation:

>>> poisson.pmf(1, [5, 6, 7, 8])
array([ 0.03368973,  0.01487251,  0.00638317,  0.0026837 ])

This may or may not be fast enough, but you can try taking pmf calls out of the loop.

Re difference between pmf and _pmf: the real work is done in the underscored functions (_pmf, _cdf etc) while the public functions (pmf, cdf) make sure that only valid arguments make it to the _pmf (The output of _pmf is not guaranteed to be meaningful if the arguments are invalid, so use on your own risk).

>>> poisson.pmf(1, -1)
nan
>>> poisson._pmf(1, -1)
/home/br/virtualenvs/scipy-dev/local/lib/python2.7/site-packages/scipy/stats/_discrete_distns.py:432: RuntimeWarning: invalid value encountered in log
  Pk = k*log(mu)-gamln(k+1) - mu
nan

Further details: https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py#L2721

ev-br
  • 24,968
  • 9
  • 65
  • 78
2
  1. Try implementing the pmf in cython. If your scipy is part of a package like Anaconda or Enthought you probably have cython installed. http://cython.org/

  2. Try running it with pypy. http://pypy.org/

  3. Rent time on a big AWS server (or similar).

joel3000
  • 1,249
  • 11
  • 22
0

I found that the scipy.stats.poisson class is tragically slow compared to just a simple python implementation.

No cython or vectors or anything.

import math


def poisson_pmf(x, mu):
    return mu**x / math.factorial(x) * math.exp(-mu)


def poisson_cdf(k, mu):
    p_total = 0.0
    for x in range(k + 1):
        p_total += poisson_pmf(x, mu)
    return p_total

And if you check the source code of scipy.stats.poisson (even the underscore prefixed versions) then it is clear why!

The above implementation is now ONLY 10x slower than the exact equivalent in C (compiled with gcc -O3 v9.3). The scipy version is at least another 10x slower.

#include <math.h>

unsigned long factorial(unsigned n) {
  unsigned long fact = 1;
  for (unsigned k = 2; k <= n; ++k)
    fact *= k;
  return fact;
}

double poisson_pmf(unsigned x, double mu) {
  return pow(mu, x) / factorial(x) * exp(-mu);
}

double poisson_cdf(unsigned k, double mu) {
  double p_total = 0.0;
  for (unsigned x = 0; x <= k; ++x)
    p_total += poisson_pmf(x, mu);
  return p_total;
}

Oliver Schönrock
  • 1,038
  • 6
  • 11
  • Note that this solution is only faster for reasonably small x. When I try this with `x = 10000`, `mu = 10`, poisson_pmf takes 2.7 ms, and a frozen scipy.stats.poisson object's pmf takes 97 _µs_. – sclamons Apr 27 '22 at 20:33