Calculating Highest Density Interval (HDI) for 2D array of data points

Question

I have two arrays, x and y, which determine the position and a third one, z, which provides the value of the radiation at such position.

I want to calculate and plot the HDI of 90% over the plot of radiation to be able to say that 90% of the radiation can be found in such region.

If I were to normalize z,the HDI for 2D radiation data means the smallest area in which one can integrate the normalized distribution function and get the desired level, in my case, 0.9 or 90%, being the integral between -inf and inf equal to 1.0.

This problem looks isomorphic to calculate the HDI of a probability distribution so I think that it can be done, but I just do not know how to proceed any further.

Edit to make it clearer: This snipped generates data (x,y, and z) which are similar to the topology of my own, so I am using it as a proxy:

import numpy as np 

## Spatial data, with randomness to show that it is not in a rectangular grid 
x = np.arange(-2,2, step=0.2)
x+= 0.1*np.random.rand(len(x)) 
y = np.arange(-2,2, step=0.2) 
y+= 0.1*np.random.rand(len(y)) 

z = np.zeros((len(x), len(y))) 
for i, ix in enumerate(x): 
    for k, iy in enumerate(y): 
        tmp = np.cos(ix)*np.sin(iy)
        if tmp>=0: 
            z[i,k] = tmp 
z=z/np.sum(z)

This is a crude sketch of my expectations: The data are mostly accumulated on the left, as shown by the contour plots, and the red line represents HDI of .9, where 90% of the integral of data is enclosed (the line is purposely not following one of the isolines, because that needn't to be the case)

The bold black line labelled 94% HDI is an example of what I want, but I want it in 2D:

I cannot find the way to use az.hdi, or az.plot_kde. I have tried to feed the data as observable to a PYMC model, so that the MCMC sampling could translate values to density of points (which is what az.hdi and az.plot_kde want), but I haven't been able to do it. A very crude approach would be to generate artificial points around each of my own proportional to the value of radiation at that point. I think there must be a more elegant and efficient way to solve this.

Getting the indices of the points inside the HDI could be an alternative solution, because calculating the convex hull I could still get the HDI.

Any idea about how to proceed is welcome.

@Reinderien I put an example of a data with similar topology in my example code, which generates x, y and z. I would be very interested to know if that is suitable, as I have never used Seaborn myself and thus I am not aware of the possibilities or how to use them. Thank you. — Iván Paradela Pérez, Jul 08 '23 at 06:42
Have you read https://stackoverflow.com/questions/22284502/highest-posterior-density-region-and-central-credible-region ? — Reinderien, Jul 09 '23 at 03:34
Yes, I did read that, before posting my question. My data are not distributed according to a normal, beta, student-t, etc. It is not even guarantee to be unimodal. Therefore, I do not have a CDF readily available. The problem could be indeed solved if I could calculate the CDF, but that is the point of my question. I have thought about using my data as input for something like PYMC and get a CDF, but I do not know enough PYMC. The second problem is that all the answers in that thread are for 1D distributions. When the data are 2D, even for unimodal distributions it is much harder to calculate. — Iván Paradela Pérez, Jul 09 '23 at 04:08
Another way to solve the problem would be to somehow get the CDF of a gaussian process regression in 2D and getting a CDF out it. I am sure that it can be easily done in PYMC, but again I do not have the knowledge of PYMC to achieve it. I only know how to work it on 1D data and I do not know how to get the CDF out of the resulting object. — Iván Paradela Pérez, Jul 09 '23 at 04:15
I would be grateful if you could construct an alternative data set to demonstrate your method. If it works, I can then figure out how to apply it to my real data. Thank you. — Iván Paradela Pérez, Jul 09 '23 at 18:27
It took some work, but I've landed on a solution that I'm reasonably confident is correct. Kernel density estimation is not necessary with this approach, and the iso-contour "looks like" the profile of the original data, as it should. — Reinderien, Jul 10 '23 at 04:02

Reinderien · Accepted Answer · 2023-07-10T03:53:34.473

I demonstrate a method using pure scipy/numpy/matplotlib.

The tricky part about this function is that it's a "density function" but not a "probability density function", so the usual histogram binning etc. does not apply. You need to do a linear tesselation and then a double integral. The resultant density is a function of domain area.

import matplotlib.pyplot as plt
import numpy as np
import scipy.interpolate
from numpy.random import default_rng

# Spatial data, with randomness to show that it is not in a rectangular grid
rand = default_rng(seed=0)
n = 20
y, x = np.linspace(-2, 2, num=n) + rand.uniform(-0.05, 0.05, size=(2, n))
radiation = np.clip(
    np.sin(y)[:, np.newaxis] * np.cos(x)[np.newaxis, :],
    a_min=0, a_max=None,
)

# Since the actual grid is known to be irregular, interpolate to a regular grid
# This is necessary for a non-biased double integral
# Irregular edge data need to be discarded for pure interpolation (or else extrapolation is needed)
bound = min(x.max(), y.max(), -x.min(), -y.min())
xy_irregular = np.stack(np.meshgrid(x, y), axis=-1).reshape((-1, 2))
x_regular = np.linspace(-bound, bound, n)
y_regular = np.linspace(-bound, bound, n)
xy_regular = np.stack(
    np.meshgrid(x_regular, y_regular), axis=-1,
).reshape((-1, 2))
rad_interpolated = scipy.interpolate.griddata(
    values=radiation.ravel(),
    points=xy_irregular, xi=xy_regular,
    # Linear tesselation: cannot use cubic or else negatives will be introduced
    method='linear',
)

# Effectively double integral over dxdy
density = np.sort(rad_interpolated)
cdf = (density / density.sum()).cumsum()

# The highest-density interval is equivalent to an inverse empirical (non-normal) CDF evaluated at
# the lower-bound quantile.
include_quantile = 0.6
exclude_quantile = 1 - include_quantile
idensity = cdf.searchsorted(exclude_quantile)
qdensity = density[idensity]

ax_left: plt.Axes
ax_right: plt.Axes
fig, (ax_left, ax_right) = plt.subplots(ncols=2)

ax_left.contour(x, y, radiation, levels=[qdensity])
ax_left.set_title(f'Radiation bound at hdi({include_quantile})={qdensity:.3f}')

area = np.linspace(0, (2*bound)**2, len(cdf))
ax_right.plot(area, cdf)
ax_right.set_title('Radiation, cumulative distribution')
ax_right.set_xlabel('Function area')
ax_right.set_ylabel('Density')
ax_right.axhline(exclude_quantile, c='red')
ax_right.axvline(area[idensity], c='red')

plt.show()

Thank you very much for your contribution, but I do not think that this is the solution for my problem, unless I am not understanding the plot. If you were to plot the contour(x,y,radiation.reshape(20,20).T), it shows that the cumulative 90% should be not centered, but somewhere on the left side, probably between one of the iso lines. Sadly I cannot post pictures of this in the comments, as far as I know, but with the contour plot suggested above it is pretty clear. I have also tried to play with the bw_adjust and the thresh kwargs to no avail. — Iván Paradela Pérez, Jul 08 '23 at 21:09
I added an image that is a crude sketch of my expectation for the solution of the problem. — Iván Paradela Pérez, Jul 08 '23 at 21:25
I feel like the mathematical definition is given in the title. I am looking for the HDI, the highest density interval, which for 2D data means the smallest area in which one can integrate the normalized distribution function and get the desired level, in my case, 0.9 or 90%, being the integral between -inf and inf equal to 1.0. It is not about the isolines, like what contour does. — Iván Paradela Pérez, Jul 09 '23 at 03:12
I am just trying to use more layman terms and to be more loosy-goosy with my explanation in case other people have encountered a similar problem, but they didn't know how it was called. Some people also now it as hdp, highest density posterior, or bci, bayesian credible interval. I was trying to use arviz.hdi for that reason. — Iván Paradela Pérez, Jul 09 '23 at 03:18
I have edited the original question trying to be more precise and less confusing, sorry for that. — Iván Paradela Pérez, Jul 09 '23 at 03:29
I still have to try if this works for my real case, but it looks so much better than my solution, so I will accept your answer for now. Thank you for your kind work. — Iván Paradela Pérez, Jul 11 '23 at 00:57

Iván Paradela Pérez · Answer 2 · 2023-07-09T20:51:10.597

I have solved my problem. It is not the most precise way, nor the most elegant, so if someone else comes up with a better solution, I will change my accepted answer.

The idea is to transform values into points. For that, we provide a resolution and then we repeat the following cycle for the range of such resolution:

Find the highest value.
Duplicate the amount of points at that coordinate.
Divide the value by half.

Then we can use arviz.plot_kde, which plots density ranges, with the hdi_probs kwarg to get the desired result.

Here is my code:

import matplotlib.pyplot as plt
import numpy as np
import arviz as az
from numpy.random import default_rng

rand = default_rng(seed=0)
n = 20
y, x = np.linspace(-2, 2, num=n) + rand.uniform(-0.05, 0.05, size=(2, n))
radiation = np.clip(
    np.sin(y)[:, np.newaxis] * np.cos(x)[np.newaxis, :],
    a_min=0, a_max=None,
).ravel()

yy =  np.repeat(y, n)
xx = np.tile(x, n)


resolution = 1000
points = np.ones_like(radiation, dtype=int)

tmp= radiation.copy()

for i in range(resolution):
    ind = np.argmax(tmp)
    tmp[ind] /=2.0
    points[ind] *=2
points[points==1] = 0

fig, ax =  plt.subplots()
ax.scatter(xx,yy,points)
plt.draw()

rx = []
ry = []

for i, npt in enumerate(points):
    rx += [xx[i]]*npt
    ry += [yy[i]]*npt

rx = np.array(rx)
ry = np.array(ry)


fig, ax =  plt.subplots()
az.plot_kde(rx,ry,hdi_probs=[0.393], ax=ax)
plt.draw()

fig, ax =  plt.subplots()
az.plot_kde(rx,ry,hdi_probs=[0.393, 0.865, 0.989], ax=ax)
plt.draw()


fig, ax =  plt.subplots()
ax.contourf(x, y, radiation.reshape(20,20))
az.plot_kde(rx,ry,hdi_probs=[0.60], ax=ax,
    contourf_kwargs={'alpha':0},
    contour_kwargs={'linewidths':2.0, 'colors':'r'})
plt.draw()

plt.show()

Here are the images:

Finally, this is the image that I was looking for (I changed my mind to 60% instead of 90% after seeing the results):

The next step would be to experiment with giving each point a very narrow 2D Gaussian distribution so that the new points are assigned according to that instead of given exactly the same position. That might help the HDI algorithms, or not.

I hope this can help someone in the future with the same problem!

I don't particularly trust the "wobbles" from that kernel density estimate. — Reinderien, Jul 10 '23 at 03:59

Calculating Highest Density Interval (HDI) for 2D array of data points

2 Answers2