28

I need to calculate binomial confidence intervals for large set of data within a script of python. Do you know any function or library of python that can do this?

Ideally I would like to have a function like this http://statpages.org/confint.html implemented on python.

Thanks for your time.

Geparada
  • 2,898
  • 9
  • 31
  • 43
  • 4
    Did you look at Scipy, statsmodels, Pandas? (Just suggestions, I don't know if any of these actually has what you want.) – Fred Foo Oct 24 '12 at 22:51
  • 1
    Would this: http://math.stackexchange.com/questions/27518/how-to-calculate-a-confidence-interval-for-a-binomial-given-a-specific-prior help? – favoretti Oct 24 '12 at 22:51
  • @favoretti I found this post before, I'm sure that there are many ways to do it with R, but fist I want to know if there is any way to do it with python. – Geparada Oct 24 '12 at 22:56
  • This [paper](http://arxiv.org/abs/1012.0566) might also be helpful... – Andy Hayden Oct 24 '12 at 23:05
  • @hayden I already have it, thanks! – Geparada Oct 25 '12 at 00:35
  • In Python, you want to use statsmodels. The functions you are looking for are mostly available here: https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.proportion_confint.html – cgnorthcutt Oct 23 '19 at 17:21

9 Answers9

52

Just noting because it hasn't been posted elsewhere here that statsmodels.stats.proportion.proportion_confint lets you get a binomial confidence interval with a variety of methods. It only does symmetric intervals, though.

Danica
  • 28,423
  • 6
  • 90
  • 122
  • 4
    assuming you have `k` successes out of `n` for significance level `alpha` then the confidence interval is given by: `proportion_confint(k, n, method='binom_test', alpha=0.05)` – Matt Sep 15 '20 at 16:43
  • @Matt `proportion_confint(3, 10, method='binom_test', alpha=0.05)` causes my Python kernel to restart :/ – endolith Jul 12 '22 at 21:53
11

I would say that R (or another stats package) would probably serve you better if you have the option. That said, if you only need the binomial confidence interval you probably don't need an entire library. Here's the function in my most naive translation from javascript.

def binP(N, p, x1, x2):
    p = float(p)
    q = p/(1-p)
    k = 0.0
    v = 1.0
    s = 0.0
    tot = 0.0

    while(k<=N):
            tot += v
            if(k >= x1 and k <= x2):
                    s += v
            if(tot > 10**30):
                    s = s/10**30
                    tot = tot/10**30
                    v = v/10**30
            k += 1
            v = v*q*(N+1-k)/k
    return s/tot

def calcBin(vx, vN, vCL = 95):
    '''
    Calculate the exact confidence interval for a binomial proportion

    Usage:
    >>> calcBin(13,100)    
    (0.07107391357421874, 0.21204372406005856)
    >>> calcBin(4,7)   
    (0.18405151367187494, 0.9010086059570312)
    ''' 
    vx = float(vx)
    vN = float(vN)
    #Set the confidence bounds
    vTU = (100 - float(vCL))/2
    vTL = vTU

    vP = vx/vN
    if(vx==0):
            dl = 0.0
    else:
            v = vP/2
            vsL = 0
            vsH = vP
            p = vTL/100

            while((vsH-vsL) > 10**-5):
                    if(binP(vN, v, vx, vN) > p):
                            vsH = v
                            v = (vsL+v)/2
                    else:
                            vsL = v
                            v = (v+vsH)/2
            dl = v

    if(vx==vN):
            ul = 1.0
    else:
            v = (1+vP)/2
            vsL =vP
            vsH = 1
            p = vTU/100
            while((vsH-vsL) > 10**-5):
                    if(binP(vN, v, 0, vx) < p):
                            vsH = v
                            v = (vsL+v)/2
                    else:
                            vsL = v
                            v = (v+vsH)/2
            ul = v
    return (dl, ul)
Curt
  • 1,394
  • 9
  • 16
  • Thanks you so much... I still don't find any python library to do this, so I'll use this code or R. THANKS! – Geparada Oct 25 '12 at 16:32
  • @Kurtis: which Javascript library did you start with that offers this fine functionality? – Ahmed Fasih May 29 '13 at 12:48
  • I got it from the javascript code on the page in the question (i.e. http://statpages.org/confint.html). It isn't a whole library, just a function on that page – Curt Aug 03 '13 at 16:35
  • You can use this Python package: https://www.statsmodels.org/devel/generated/statsmodels.stats.proportion.proportion_confint.html – cgnorthcutt Oct 23 '19 at 17:21
3

While the scipy.stats module has a method .interval() to compute the equal tails confidence, it lacks a similar method to compute the highest density interval. Here is a rough way to do it using methods found in scipy and numpy.

This solution also assumes you want to use a Beta distribution as a prior. The hyper-parameters a and b are set to 1, so that the default prior is a uniform distribution between 0 and 1.

import numpy
from scipy.stats import beta
from scipy.stats import norm

def binomial_hpdr(n, N, pct, a=1, b=1, n_pbins=1e3):
    """
    Function computes the posterior mode along with the upper and lower bounds of the
    **Highest Posterior Density Region**.

    Parameters
    ----------
    n: number of successes 
    N: sample size 
    pct: the size of the confidence interval (between 0 and 1)
    a: the alpha hyper-parameter for the Beta distribution used as a prior (Default=1)
    b: the beta hyper-parameter for the Beta distribution used as a prior (Default=1)
    n_pbins: the number of bins to segment the p_range into (Default=1e3)

    Returns
    -------
    A tuple that contains the mode as well as the lower and upper bounds of the interval
    (mode, lower, upper)

    """
    # fixed random variable object for posterior Beta distribution
    rv = beta(n+a, N-n+b)
    # determine the mode and standard deviation of the posterior
    stdev = rv.stats('v')**0.5
    mode = (n+a-1.)/(N+a+b-2.)
    # compute the number of sigma that corresponds to this confidence
    # this is used to set the rough range of possible success probabilities
    n_sigma = numpy.ceil(norm.ppf( (1+pct)/2. ))+1
    # set the min and max values for success probability 
    max_p = mode + n_sigma * stdev
    if max_p > 1:
        max_p = 1.
    min_p = mode - n_sigma * stdev
    if min_p > 1:
        min_p = 1.
    # make the range of success probabilities
    p_range = numpy.linspace(min_p, max_p, n_pbins+1)
    # construct the probability mass function over the given range
    if mode > 0.5:
        sf = rv.sf(p_range)
        pmf = sf[:-1] - sf[1:]
    else:
        cdf = rv.cdf(p_range)
        pmf = cdf[1:] - cdf[:-1]
    # find the upper and lower bounds of the interval 
    sorted_idxs = numpy.argsort( pmf )[::-1]
    cumsum = numpy.cumsum( numpy.sort(pmf)[::-1] )
    j = numpy.argmin( numpy.abs(cumsum - pct) )
    upper = p_range[ (sorted_idxs[:j+1]).max()+1 ]
    lower = p_range[ (sorted_idxs[:j+1]).min() ]    

    return (mode, lower, upper)
mtw729
  • 163
  • 1
  • 5
  • 2
    If I run `binomial_confidence(0,0,0.5,a=2,b=2)` (i.e. [this](http://www.wolframalpha.com/input/?i=beta%282%2C2%29+distribution) distribution) it should output a symmetric interval around `0.5`, but it outputs `(0.0, 0.447)`. – Amelio Vazquez-Reina Mar 07 '14 at 13:56
  • For those parameters, it also outputs the mode on `0`, which is also wrong. The mode should be `0.5`. – Amelio Vazquez-Reina Mar 07 '14 at 13:58
  • 2
    Thanks for the bug-report. The problem steps from the calculation of the mode, which assumes at least 1 of the variables (n,N,a,b) is a float. I realize now I shouldn't make that assumption and the offending line has been changed accordingly. `binomial_confidence(0, 0, 0.5, a=2, b=2)` now returns `(0.5, 0.32648112494601628, 0.67351887505398356)` – mtw729 Mar 09 '14 at 05:28
  • 2
    Thanks @mtw729 - To clarify, is the code that you wrote computing the **High Posterior Density Region** OR the **Central Credible Interval**? See the definitions [here](http://stackoverflow.com/questions/22284502/high-posterior-density-region-and-credible-intervals). – Amelio Vazquez-Reina Mar 09 '14 at 16:09
  • 2
    This is code calculates the HPD region (as noted in the description at the top of my answer). Apologies that the function name makes this less than clear. – mtw729 Mar 10 '14 at 03:33
  • numpy 1.19.5 complains that linspace's 3rd arg (num) needs to be an integer, and n_pbins+1 is a float. I was able to fix this by changing the default n_pbins from 1e3 to 1000. – Near Privman Mar 22 '21 at 09:08
  • If you're using priors, isn't this a Bayesian approach -- and therefore yields a *credible* interval, rather than a *confidence* interval, as mentioned in the question? – Denziloe Jan 03 '22 at 21:00
3

Just been trying this myself. If it helps here's my solution, which takes two lines of code and seems to give equivalent results to that JS page. This is the frequentist one-sided interval, I'm calling the input argument the MLE (maximum likelihood estimate) of the binomial parameter theta. I.e. mle = number of successes/number of trials. I find the upper bound of the one sided interval. The alpha value used here is therefore double the one in the JS page for the upper limit.

from scipy.stats import binom
from scipy.optimize import bisect

def binomial_ci( mle, N, alpha=0.05 ):
    """
    One sided confidence interval for a binomial test.

    If after N trials we obtain mle as the proportion of those
    trials that resulted in success, find c such that

    P(k/N < mle; theta = c) = alpha

    where k/N is the proportion of successes in the set of trials,
    and theta is the success probability for each trial. 
    """


    to_minimise = lambda c: binom.cdf(mle*N,N,c)-alpha
    return bisect(to_minimise,0,1)

To find the two sided interval, call with (1-alpha/2) and alpha/2 as arguments.

James Thorniley
  • 126
  • 1
  • 5
  • 1
    Interesting! I get slightly weird results with your method though. For 30 successes in 60 trials, both R's binom.test and statsmodels.stats.proportion.proportion_confint give (.37, .63) using Klopper-Pearson. Your method gives (.38, .63). The upper bound is the same to the 10th decimal place, but the lower bound is very different. Any idea why? – cxrodgers Oct 01 '15 at 02:04
3

The following gives exact (Clopper-Pearson) interval for binomial distribution in a simple way.

def binomial_ci(x, n, alpha=0.05):
    #x is number of successes, n is number of trials
    from scipy import stats
    if x==0:
        c1 = 0
    else:
        c1 = stats.beta.interval(1-alpha, x,n-x+1)[0]
    if x==n:
        c2=1
    else:
        c2 = stats.beta.interval(1-alpha, x+1,n-x)[1]
    return c1, c2

You may check the code by e.g.:

p1,p2 = binomial_ci(2,7)
from scipy import stats
assert abs(stats.binom.cdf(1,7,p1)-.975)<1E-5
assert abs(stats.binom.cdf(2,7,p2)-.025)<1E-5
assert abs(binomial_ci(0,7, alpha=.1)[0])<1E-5
assert abs((1-binomial_ci(0,7, alpha=.1)[1])**7-0.05)<1E-5
assert abs(binomial_ci(7,7, alpha=.1)[1]-1)<1E-5
assert abs((binomial_ci(7,7, alpha=.1)[0])**7-0.05)<1E-5

I used the relation between the binomial proportion confidence interval and the regularized incomplete beta function, as described here: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper%E2%80%93Pearson_interval

Dani
  • 139
  • 1
3

I am not an expert on statistics, but binomtest is built into SciPy and produces the same results as the accepted answer:

from scipy.stats import binomtest

binomtest(13, 100).proportion_ci()
Out[11]: ConfidenceInterval(low=0.07107304618545972, high=0.21204067708744978)

binomtest(4, 7).proportion_ci()
Out[25]: ConfidenceInterval(low=0.18405156764007, high=0.9010117215575631)

It uses Clopper-Pearson exact method by default, which matches Curt's accepted answer, which gives these values, for comparison:

    Usage:
    >>> calcBin(13,100)    
    (0.07107391357421874, 0.21204372406005856)
    >>> calcBin(4,7)   
    (0.18405151367187494, 0.9010086059570312)

It also has options for Wilson's method, with or without continuity correction, which matches TheBamf's astropy answer:

binomtest(4, 7).proportion_ci(method='wilson')
Out[32]: ConfidenceInterval(low=0.2504583645276572, high=0.8417801447485302)

binom_conf_interval(4, 7, 0.95, interval='wilson')
Out[33]: array([0.25045836, 0.84178014])

This also matches R's binom.test and statsmodels.stats.proportion.proportion_confint, according to cxrodgers' comment:

For 30 successes in 60 trials, both R's binom.test and statsmodels.stats.proportion.proportion_confint give (.37, .63) using Klopper-Pearson.

binomtest(30, 60).proportion_ci(method='exact')
Out[34]: ConfidenceInterval(low=0.3680620319424367, high=0.6319379680575633)
endolith
  • 25,479
  • 34
  • 128
  • 192
2

I needed to do this as well. I was using R and wanted to learn a way to work it out for myself. I would not say it is strictly pythonic.

The docstring explains most of it. It assumes you have scipy installed.

def exact_CI(x, N, alpha=0.95):
    """
    Calculate the exact confidence interval of a proportion 
    where there is a wide range in the sample size or the proportion.

    This method avoids the assumption that data are normally distributed. The sample size
    and proportion are desctibed by a beta distribution.

    Parameters
    ----------

    x: the number of cases from which the proportion is calulated as a positive integer.

    N: the sample size as a positive integer.

    alpha : set at 0.95 for 95% confidence intervals.

    Returns
    -------
    The proportion with the lower and upper confidence intervals as a dict.

    """
    from scipy.stats import beta
    x = float(x)
    N = float(N)
    p = round((x/N)*100,2)

    intervals = [round(i,4)*100 for i in beta.interval(alpha,x,N-x+1)]
    intervals.insert(0,p)

    result = {'Proportion': intervals[0], 'Lower CI': intervals[1], 'Upper CI': intervals[2]}

    return result
John
  • 41,131
  • 31
  • 82
  • 106
  • 1
    If you do exact_CI(7, 7) for example the Upper CI is less than 100% which doesn't make sense. – Aaron Silverman Nov 14 '13 at 02:30
  • It makes sense: just because 7 trials out of 7 gave you success doesn't mean you're 100% confident that the probability of success is 100%. Remember that `exact_CI(7, 7)` returns a 95% confidence interval. – quant_dev Jan 30 '18 at 23:43
  • @John Why did you use Beta(1, 1) as the prior and not Beta(1/2, 1/2) (Jeffrey's prior)? – quant_dev Jan 30 '18 at 23:59
2

A numpy/scipy-free way of computing the same thing using the Wilson score and an approximation to the normal cumulative density function,

import math

def binconf(p, n, c=0.95):
  '''
  Calculate binomial confidence interval based on the number of positive and
  negative events observed.

  Parameters
  ----------
  p: int
      number of positive events observed
  n: int
      number of negative events observed
  c : optional, [0,1]
      confidence percentage. e.g. 0.95 means 95% confident the probability of
      success lies between the 2 returned values

  Returns
  -------
  theta_low  : float
      lower bound on confidence interval
  theta_high : float
      upper bound on confidence interval
  '''
  p, n = float(p), float(n)
  N    = p + n

  if N == 0.0: return (0.0, 1.0)

  p = p / N
  z = normcdfi(1 - 0.5 * (1-c))

  a1 = 1.0 / (1.0 + z * z / N)
  a2 = p + z * z / (2 * N)
  a3 = z * math.sqrt(p * (1-p) / N + z * z / (4 * N * N))

  return (a1 * (a2 - a3), a1 * (a2 + a3))


def erfi(x):
  """Approximation to inverse error function"""
  a  = 0.147  # MAGIC!!!
  a1 = math.log(1 - x * x)
  a2 = (
    2.0 / (math.pi * a)
    + a1 / 2.0
  )

  return (
    sign(x) *
    math.sqrt( math.sqrt(a2 * a2 - a1 / a) - a2 )
  )


def sign(x):
  if x  < 0: return -1
  if x == 0: return  0
  if x  > 0: return  1


def normcdfi(p, mu=0.0, sigma2=1.0):
  """Inverse CDF of normal distribution"""
  if mu == 0.0 and sigma2 == 1.0:
    return math.sqrt(2) * erfi(2 * p - 1)
  else:
    return mu + math.sqrt(sigma2) * normcdfi(p)
duckworthd
  • 14,679
  • 16
  • 53
  • 68
  • Great solution! No-dependencies and implements the very useful Wilson score. The approximation here seems to be accurate to at least three decimal places in my testing--well worth the lack of dependencies. – SMX Apr 10 '18 at 20:57
1

Astropy provides such a function (although installing and importing astropy may be a bit excessive): astropy.stats.binom_conf_interval

TheBamf
  • 597
  • 5
  • 8