I am seeing that in Pystan, an HDI function can be used to provide a 95% credible interval surrounding the posterior distribution. However, they say it will only work for unimodal distributions. If my model may have a multimodal distribution (up to 4 peaks), is there a way I can find the HDI in Pystan? Thanks!
1 Answers
I wouldn't consider this a Stan/PyStan-specific issue. The Highest Density Interval is by definition a single interval and therefore inappropriate for characterizing multimodal distributions. There is a nice work by Rob Hyndman, Computing and Graphing Highest Density Regions, that extends the concept to multimodal distributions, and this has been implemented in R under the hdrcde package.
As for Python, there's a discussion of this on the PyMC Discourse site, where it is recommended to use a function (hpd_grid
) that Osvaldo Martin wrote for his "Bayesian Analysis with Python" book. The source for that function is in the hpd.py file, and for a 95% region would be used like
from hpd import hpd_grid
intervals, x, y, modes = hpd_grid(samples, alpha=0.05)
where samples
are the posterior samples of one of your parameters, and intervals
are a list of tuples representing the regions of highest density.
Example with Plot
Here's an example plot using some fake multimodal data.
import numpy as np
from matplotlib import pyplot as plt
from hpd import hpd_grid
# include two modes
samples = np.random.normal(loc=[-4,4], size=(1000, 2)).flatten()
# compute high density regions
hpd_mu, x_mu, y_mu, modes_mu = hpd_grid(samples)
plt.figure(figsize=(8,6))
# raw data
plt.hist(samples), density=True, bins=29, alpha=0.5)
# estimated distribution
plt.plot(x_mu, y_mu)
# high density intervals
for (x0, x1) in hpd_mu:
plt.hlines(y=0, xmin=x0, xmax=x1, linewidth=5)
plt.axvline(x=x0, color='grey', linestyle='--', linewidth=1)
plt.axvline(x=x1, color='grey', linestyle='--', linewidth=1)
# modes
for xm in modes_mu:
plt.axvline(x=xm, color='r')
plt.show()
Cautionary Note
It should be noted that multimodal posterior distributions on properly modeled parameters are typically rare, but do come up extremely frequently in non-converged MCMC sampling, especially when multiple chains are used (which is the best practice). If one expects multimodality a priori, then usually that leads to some form of mixture model which would eliminate the multimodality. If one doesn't expect multimodality, but the posteriors exhibit it anyway, then that's a red flag to be skeptical of the results.

- 67,214
- 13
- 180
- 245
-
Hi. Will this work if I am using the fit.optimizing() function in Pystan to obtain my posterior? – nik jacks Dec 11 '18 at 14:37
-
@nikjacks `pystan.StanModel.optimizing()` only generates a MAP point estimate of parameters, so no. Use `pystan.StanModel.sampling()` to obtain samples from your posterior. – merv Dec 11 '18 at 16:50
-
thanks! So would it make sense to use the optimizing function to generate the parameters that maximize the posterior for my result and then use the hpd_grid() function with StanModel.sampling() to generate the parameters that will give a confidence interval around the result? – nik jacks Dec 11 '18 at 17:15
-
@nikjacks I wouldn't use `optimizing` at all and in fact only ever use it in diagnostic stages of model development. It only returns a single value per parameter, and since you expect multiple modes, could change depending on your random initialization. Moreover, MAP estimates (while better than MLEs) often lie outside the typical set of high-dimensional distributions and therefore are not representative. The `modes` list returned from the `hpd_grid` is likely what you're after and the `intervals` are the bounds around them. – merv Dec 11 '18 at 18:05
-
Is there a code or a gist to do the same for `scipy` distributions? Namely for the class of `rv_continuous` in `scipy`. – Mark Jan 04 '23 at 19:14
-
@Mark this code is agnostic - it will work for any numpy array of numbers. – merv Jan 05 '23 at 02:15