22

I have numerous tuples (par1,par2), i.e. points in a 2 dimensional parameter space obtained from repeating an experiment multiple times.

I'm looking for a possibility to calculate and visualize confidence ellipses (not sure if thats the correct term for this). Here an example plot that I found in the web to show what I mean:

enter image description here

source: blogspot.ch/2011/07/classification-and-discrimination-with.html

So in principle one has to fit a multivariate normal distribution to a 2D histogram of data points I guess. Can somebody help me with this?

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • 1
    What's the input data? Is it an array of 2d points? Do you know in advance that there are 2 clusters? – Daniel Sep 06 '12 at 19:21
  • yes I know the number of clusters. I don't yet know what the format of the input data is, I guess a nx2 array where n is the number of points. – Raphael Roth Sep 07 '12 at 05:06
  • In that case you should cluster them first, then fit a gaussian to each cluster and finally plot the confidence intervals. Look at sklearn.cluster – Daniel Sep 07 '12 at 17:01

4 Answers4

39

It sounds like you just want the 2-sigma ellipse of the scatter of points?

If so, consider something like this (From some code for a paper here: https://github.com/joferkington/oost_paper_code/blob/master/error_ellipse.py):

import numpy as np

import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse

def plot_point_cov(points, nstd=2, ax=None, **kwargs):
    """
    Plots an `nstd` sigma ellipse based on the mean and covariance of a point
    "cloud" (points, an Nx2 array).

    Parameters
    ----------
        points : An Nx2 array of the data points.
        nstd : The radius of the ellipse in numbers of standard deviations.
            Defaults to 2 standard deviations.
        ax : The axis that the ellipse will be plotted on. Defaults to the 
            current axis.
        Additional keyword arguments are pass on to the ellipse patch.

    Returns
    -------
        A matplotlib ellipse artist
    """
    pos = points.mean(axis=0)
    cov = np.cov(points, rowvar=False)
    return plot_cov_ellipse(cov, pos, nstd, ax, **kwargs)

def plot_cov_ellipse(cov, pos, nstd=2, ax=None, **kwargs):
    """
    Plots an `nstd` sigma error ellipse based on the specified covariance
    matrix (`cov`). Additional keyword arguments are passed on to the 
    ellipse patch artist.

    Parameters
    ----------
        cov : The 2x2 covariance matrix to base the ellipse on
        pos : The location of the center of the ellipse. Expects a 2-element
            sequence of [x0, y0].
        nstd : The radius of the ellipse in numbers of standard deviations.
            Defaults to 2 standard deviations.
        ax : The axis that the ellipse will be plotted on. Defaults to the 
            current axis.
        Additional keyword arguments are pass on to the ellipse patch.

    Returns
    -------
        A matplotlib ellipse artist
    """
    def eigsorted(cov):
        vals, vecs = np.linalg.eigh(cov)
        order = vals.argsort()[::-1]
        return vals[order], vecs[:,order]

    if ax is None:
        ax = plt.gca()

    vals, vecs = eigsorted(cov)
    theta = np.degrees(np.arctan2(*vecs[:,0][::-1]))

    # Width and height are "full" widths, not radius
    width, height = 2 * nstd * np.sqrt(vals)
    ellip = Ellipse(xy=pos, width=width, height=height, angle=theta, **kwargs)

    ax.add_artist(ellip)
    return ellip

if __name__ == '__main__':
    #-- Example usage -----------------------
    # Generate some random, correlated data
    points = np.random.multivariate_normal(
            mean=(1,1), cov=[[0.4, 9],[9, 10]], size=1000
            )
    # Plot the raw points...
    x, y = points.T
    plt.plot(x, y, 'ro')

    # Plot a transparent 3 standard deviation covariance ellipse
    plot_point_cov(points, nstd=3, alpha=0.5, color='green')

    plt.show()

enter image description here

Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • nice, thanks for the answer. I hope I got this right: Assuming a multivariate normal distribution, one can simply take the eigenvalues and the eigenvectors to calculate the ellipses. – Raphael Roth Sep 10 '12 at 13:15
  • unfortunately, matplotlib patches cannot be drawn with logarithmic axes (or at least not correctly) as I need to .... why is life so complicated? – Raphael Roth Sep 10 '12 at 13:51
  • Yeah, I never thought to test it on logarithmic axes. One work-around would be to use a PathPatch, which will draw correctly on logarithmic axes. You'd have to generate points along the ellipse manually, but that's not too hard. – Joe Kington Sep 11 '12 at 00:15
  • @RaphaelRoth Another possibility to use logarithmic scale would be to fake it by transforming the datapoints and using a tick formatter for the axes (doesn't sound easy, but could be a way) – heltonbiker Oct 31 '12 at 13:02
  • @JoeKington maybe you are still around here and can react to this comment: I used parts of your code to plot such ellipses, it works nicely. I understand why i get orthogonal eigenvectors to find the ellipse-axis orientation, it is the same as in any principal component analysis. But tried to understand why the semi-axis-lengths would scale with the square root of the corresponding eigenvalues of the covariance matrix. Do you have any idea? I also asked it even here: http://math.stackexchange.com/questions/911792/3-sigma-ellipse-why-axis-length-scales-with-square-root-of-eigenvalues-of-covar – Nras Aug 28 '14 at 11:10
  • @Nras - In a nutshell, it's because the covariance matrix gives you variances, and confidence limits are typically expressed in terms of standard deviations (e.g. 2 sigma --> 95%, etc) (Standard deviation is just the sqrt of the variance.) Because the eigenvectors returned have unit length, the eigenvalues directly correspond to the variances. (This wouldn't be true if the eigenvectors weren't unit length. We'd need to scale by their magnitude in that case.) Hopefully that helps a bit. I can add a more complete (though not mathematically rigorous) explanation, if it would help. – Joe Kington Aug 28 '14 at 14:56
  • 2
    @JoeKington Don't we need to refer to the chi-square probability distribution table to find out our `nstd`, i.e. whether be it 68%, 90% or 95% ? – Srivatsan Jul 05 '15 at 10:28
  • 2
    @ThePredator - If you're using it as a test, yes. (In other words, is this a different/same distribution than another one at p confidence level?) If you're simply using it as a description, then no. The confidence that you're correctly estimated the standard deviation and mean from the number of samples that you have is an entirely separate question, though. – Joe Kington Jul 05 '15 at 22:15
  • @JoeKington: One final information, could you please elaborate on the usage of `np.degrees(np.arctan2(*vecs[:,0][::-1]))` for the angle? From [this website](http://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/) I see that it is `arctan(y)/(x)`, but you are using `arctan2`. – Srivatsan Jul 06 '15 at 08:27
  • 3
    @ThePredator - `arctan2` returns the full angle (can be in any of the 4 quadrants). `arctan` restricts the output to quadrants 1 and 4 (between -pi/2 and pi/2). You may notice that `arctan` takes a single parameter. Therefore, it can't distinguish between angles in quadrants 1 and 4 and a similar angle in quadrants 2 and 3. This is a convention that's shared by many other programming languages, in no small part because C defines them that way. – Joe Kington Jul 06 '15 at 11:44
  • 1
    @ThePredator the correct way is implemented in another answer by Syrtis Major. I guess this answer should be edited to include, that this definition of sigma does not correspond to 95%. – Zappel May 26 '17 at 14:36
8

Refer the post How to draw a covariance error ellipse.

Here's the python realization:

import numpy as np
from scipy.stats import norm, chi2

def cov_ellipse(cov, q=None, nsig=None, **kwargs):
    """
    Parameters
    ----------
    cov : (2, 2) array
        Covariance matrix.
    q : float, optional
        Confidence level, should be in (0, 1)
    nsig : int, optional
        Confidence level in unit of standard deviations. 
        E.g. 1 stands for 68.3% and 2 stands for 95.4%.

    Returns
    -------
    width, height, rotation :
         The lengths of two axises and the rotation angle in degree
    for the ellipse.
    """

    if q is not None:
        q = np.asarray(q)
    elif nsig is not None:
        q = 2 * norm.cdf(nsig) - 1
    else:
        raise ValueError('One of `q` and `nsig` should be specified.')
    r2 = chi2.ppf(q, 2)

    val, vec = np.linalg.eigh(cov)
    width, height = 2 * sqrt(val[:, None] * r2)
    rotation = np.degrees(arctan2(*vec[::-1, 0]))

    return width, height, rotation

The meaning of standard deviation is wrong in the answer of Joe Kington. Usually we use 1, 2 sigma for 68%, 95% confidence levels, but the 2 sigma ellipse in his answer does not contain 95% probability of the total distribution. The correct way is using a chi square distribution to esimate the ellipse size as shown in the post.

Syrtis Major
  • 3,791
  • 1
  • 30
  • 40
  • 2
    The ellipse displayed in his answer isn't a 2-sigma ellipse. It's a 3-sigma ellipse, and it does contain about as many points as you'd expect from a 3-sigma ellipse. – senderle Jun 27 '18 at 12:25
  • 2
    I believe the difference arises because this answers describes a [confidence ellipse](https://en.wikipedia.org/wiki/Confidence_interval) while Joe's answer describes an N-sigma error ellipse. The difference between these two is explained [here](https://ftp.ngs.noaa.gov/PUBS_LIB/AlgorithmsForConfidenceCirclesAndEllipses_TR_NOS107_CGS3.pdf). – Gabriel Aug 08 '18 at 17:48
  • Updating my cmmt above since the link to the PDF is now dead. This is the [new link](https://www.ngs.noaa.gov/PUBS_LIB/AlgorithmsForConfidenceCirclesAndEllipses_TR_NOS107_CGS3.pdf). – Gabriel Jun 08 '22 at 13:12
4

I slightly modified one of the examples above that plots the error or confidence region contours. Now I think it gives the right contours.

It was giving the wrong contours because it was applying the scoreatpercentile method to the joint dataset (blue + red points) when it should be applied separately to each dataset.

The modified code can be found below:

import numpy
import scipy
import scipy.stats
import matplotlib.pyplot as plt

# generate two normally distributed 2d arrays
x1=numpy.random.multivariate_normal((100,420),[[120,80],[80,80]],400)
x2=numpy.random.multivariate_normal((140,340),[[90,-70],[-70,80]],400)

# fit a KDE to the data
pdf1=scipy.stats.kde.gaussian_kde(x1.T)
pdf2=scipy.stats.kde.gaussian_kde(x2.T)

# create a grid over which we can evaluate pdf
q,w=numpy.meshgrid(range(50,200,10), range(300,500,10))
r1=pdf1([q.flatten(),w.flatten()])
r2=pdf2([q.flatten(),w.flatten()])

# sample the pdf and find the value at the 95th percentile
s1=scipy.stats.scoreatpercentile(pdf1(pdf1.resample(1000)), 5)
s2=scipy.stats.scoreatpercentile(pdf2(pdf2.resample(1000)), 5)

# reshape back to 2d
r1.shape=(20,15)
r2.shape=(20,15)

# plot the contour at the 95th percentile
plt.contour(range(50,200,10), range(300,500,10), r1, [s1],colors='b')
plt.contour(range(50,200,10), range(300,500,10), r2, [s2],colors='r')

# scatter plot the two normal distributions
plt.scatter(x1[:,0],x1[:,1],alpha=0.3)
plt.scatter(x2[:,0],x2[:,1],c='r',alpha=0.3)
Rodrigo
  • 520
  • 5
  • 8
0

I guess what you are looking for is to compute the Confidence Regions.

I don't know much how about it, but as a starting point, I would check the sherpa application for python. At least, in their Scipy 2011 talk, authors mention that you can determine and obtain confidence regions with it (you may need to have a model for your data though).

See the video and corresponding slides of the Sherpa talk.

HTH

gcalmettes
  • 8,474
  • 1
  • 32
  • 28