181

I have sample data which I would like to compute a confidence interval for, assuming a normal distribution.

I have found and installed the numpy and scipy packages and have gotten numpy to return a mean and standard deviation (numpy.mean(data) with data being a list). Any advice on getting a sample confidence interval would be much appreciated.

Marco Cerliani
  • 21,233
  • 3
  • 49
  • 54
Bmayer0122
  • 2,138
  • 2
  • 14
  • 7
  • 1
    you can use bootstrap: https://stackoverflow.com/a/66008548/10375049 – Marco Cerliani Feb 02 '21 at 12:54
  • what does one have to do for data that is not classification e.g. regression arbitrary real values? – Charlie Parker Dec 14 '21 at 23:42
  • 1
    answering my own question, yes, see: https://stats.stackexchange.com/questions/554332/confidence-interval-given-the-population-mean-and-standard-deviation?noredirect=1&lq=1 – Charlie Parker Dec 16 '21 at 01:30
  • answering my own comment above: I think it can be used for any data because of the following: I believe it is fine since the mean and std are calculated for general numeric data and the z_p/t_p value only takes in the confidence interval and data size, so it is independent of assumptions on the distribution of data. So yes I think this equation can be used for both classification and regression. – Charlie Parker May 09 '22 at 18:14
  • I came here to get the bounty, but your goals are so different that it will be difficult to write a question that is at the same time relevant to this question and addresses your questions. But in summary the test used for the top answer is relevant for Normally distributed data with few samples (as the number of samples grow it converges to the normal distribution itself). If you apply to a data that is not normal the confidence intervals will not be correct. – Bob May 10 '22 at 19:54

6 Answers6

246
import numpy as np
import scipy.stats

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

You can calculate like this.

petezurich
  • 9,280
  • 9
  • 43
  • 57
shasan
  • 2,645
  • 1
  • 13
  • 7
  • 45
    Careful with the "private" use of `sp.stats.t._ppf`. I'm not that comfortable with that in there without further explanation. Better to use `sp.stats.t.ppf` directly, unless you are sure you know what you are doing. On quick inspection of [the source](https://github.com/scipy/scipy/blob/v0.13.0/scipy/stats/distributions.py#L1474) there is a fair amount of code skipped with `_ppf`. Possibly benign, but also possibly an unsafe optimization attempt? – Russ Mar 12 '14 at 07:32
  • does this work for classification AND regression? (e.g. if there are negative values, arbitary magnitude) – Charlie Parker Dec 14 '21 at 23:37
  • 1
    anssering myself: yes it is since it's computing CI `mu_n +- t.val(0.95) * std_n/ sqrt(n)` see for details: https://stats.stackexchange.com/questions/554332/confidence-interval-given-the-population-mean-and-standard-deviation?noredirect=1&lq=1 – Charlie Parker Dec 16 '21 at 01:30
  • 1
    what is the type of `data`? – Charlie Parker Feb 16 '22 at 19:21
  • why are you doing ` a = 1.0 * data`? what is the type of `a`? – Charlie Parker Feb 16 '22 at 19:24
  • 1
    could you provide some example fake data for this? Why is the output of h not a scalar but is an array/list or something like that? – Charlie Parker Feb 16 '22 at 19:39
  • are these examples correct? https://stackoverflow.com/a/71148456/1601580 I'm mainly worried about the N by M case (or general tensors) – Charlie Parker Feb 16 '22 at 19:53
  • does this CI calculation assume anything special about the distribution of the data? e.g. does it assume it's for classification or Bernoulli or something? – Charlie Parker May 09 '22 at 18:03
  • I think it can be used for any data because of the following: I believe it is fine since the mean and std are calculated for general numeric data and the z_p/t_p value only takes in the confidence interval and data size, so it is independent of assumptions on the distribution of data. – Charlie Parker May 09 '22 at 18:11
  • How can I include the p-value to return on this algorithm? Assuming I have the expected mean for the list – capivarao Dec 02 '22 at 11:02
196

Here a shortened version of shasan's code, calculating the 95% confidence interval of the mean of array a:

import numpy as np, scipy.stats as st

st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a))

But using StatsModels' tconfint_mean is arguably even nicer:

import statsmodels.stats.api as sms

sms.DescrStatsW(a).tconfint_mean()

The underlying assumptions for both are that the sample (array a) was drawn independently from a normal distribution with unknown standard deviation (see MathWorld or Wikipedia).

For large sample size n, the sample mean is normally distributed, and one can calculate its confidence interval using st.norm.interval() (as suggested in Jaime's comment). But the above solutions are correct also for small n, where st.norm.interval() gives confidence intervals that are too narrow (i.e., "fake confidence"). See my answer to a similar question for more details (and one of Russ's comments here).

Here an example where the correct options give (essentially) identical confidence intervals:

In [9]: a = range(10,14)

In [10]: mean_confidence_interval(a)
Out[10]: (11.5, 9.4457397432391215, 13.554260256760879)

In [11]: st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a))
Out[11]: (9.4457397432391215, 13.554260256760879)

In [12]: sms.DescrStatsW(a).tconfint_mean()
Out[12]: (9.4457397432391197, 13.55426025676088)

And finally, the incorrect result using st.norm.interval():

In [13]: st.norm.interval(0.95, loc=np.mean(a), scale=st.sem(a))
Out[13]: (10.23484868811834, 12.76515131188166)
Asclepius
  • 57,944
  • 17
  • 167
  • 143
Ulrich Stern
  • 10,761
  • 5
  • 55
  • 76
  • 1
    I believe you should be calling `st.t.interval(0.05)` to get the 95% confidence interval. – Scimonster Jun 26 '16 at 14:33
  • 6
    No, `st.t.interval(0.95)` is correct for the 95% confidence interval, see the [docs](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html) for `scipy.stats.t`. SciPy's naming the argument `alpha` seems less than ideal, though. – Ulrich Stern Jun 26 '16 at 15:33
  • If I have two arrays of data and then calculated the difference of their mean. Is there any way to get a 95% CI for this mean difference? Could you think of any easy way to do it like the one you provide here by using StatsModelsl? – steven Jan 26 '19 at 20:50
  • Student-t distribution should be used when the sample size is small (less than 30), which is in this case ([10,11,12,13). As a result, normal distribution gives a different result. If you increase your sample size to 1000 for instance, t- and norm give almost identical results. – Mehdi Jul 30 '21 at 17:25
  • does this work for classification AND regression? (e.g. if there are negative values, arbitary magnitude) – Charlie Parker Dec 14 '21 at 23:37
  • 1
    Why use `scale=st.sem(a)` instead of `scale=np.std(a)`? Since scale is the standard deviation? – NoName Jul 19 '22 at 16:39
29

Starting Python 3.8, the standard library provides the NormalDist object as part of the statistics module:

from statistics import NormalDist

def confidence_interval(data, confidence=0.95):
  dist = NormalDist.from_samples(data)
  z = NormalDist().inv_cdf((1 + confidence) / 2.)
  h = dist.stdev * z / ((len(data) - 1) ** .5)
  return dist.mean - h, dist.mean + h

This:

  • Creates a NormalDist object from the data sample (NormalDist.from_samples(data), which gives us access to the sample's mean and standard deviation via NormalDist.mean and NormalDist.stdev.

  • Compute the Z-score based on the standard normal distribution (represented by NormalDist()) for the given confidence using the inverse of the cumulative distribution function (inv_cdf).

  • Produces the confidence interval based on the sample's standard deviation and mean.


This assumes the sample size is big enough (let's say more than ~100 points) in order to use the standard normal distribution rather than the student's t distribution to compute the z value.

Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
  • does this work for classification AND regression? (e.g. if there are negative values, arbitary magnitude) – Charlie Parker Dec 14 '21 at 23:38
  • 1
    Is there any reason to use the wrong but approximately correct normal distribution instead the perfectly correct t-distribution? I don't see any disadvantage of using the correct t-distribution (see https://stackoverflow.com/a/15034143/7735095 or https://stackoverflow.com/a/34474255/7735095) independent of the number of observations. – Jakob Dec 22 '21 at 12:07
18

Start with looking up the z-value for your desired confidence interval from a look-up table. The confidence interval is then mean +/- z*sigma, where sigma is the estimated standard deviation of your sample mean, given by sigma = s / sqrt(n), where s is the standard deviation computed from your sample data and n is your sample size.

bogatron
  • 18,639
  • 6
  • 53
  • 47
  • 31
    `scipy.stats.norm.interval(confidence, loc=mean, scale=sigma)` – Jaime Feb 22 '13 at 23:41
  • 4
    @bogatron, about the suggested calculus for the confidence interval, wouldn't be **mean +/- z * sigma/sqrt(n)**, where n is sample size? – David Feb 19 '15 at 00:12
  • 3
    @David, you are correct. I misstated the meaning of `sigma`. `sigma` in my answer should be the estimated standard deviation of the sample mean, not the estimated standard deviation of the distribution. I've updated the answer to clarify that. Thanks for pointing that out. – bogatron Feb 19 '15 at 14:34
  • There is a mislead in @Jaime comment. If you are computing the t student confidence interval, you don't use sigma, you use the standard error which is sigma/np.sqrt(total number of observations), otherwise you gonna get the wrong result. You could also say: scipy.stats.norm.interval(confidence, loc=mean, scale=standard error) – João Vitor Gomes May 07 '21 at 01:02
  • "looking at a look-up table" is an inappropriate answer for this stack exchange. It should be part of a library call so that code can fetch the z-score itself at runtime, and the confidence interval can be exposed to the user as a variable. – Chris Dec 06 '21 at 15:03
  • does this work for classification AND regression? (e.g. if there are negative values, arbitary magnitude) – Charlie Parker Dec 14 '21 at 23:38
  • Using z-statsitics is wrong here, instead one should use t-statistics here. For a small number of observations the confidence intervals according to your solution would be way too small! Only for very large number of observations your sollution would be an prroximation to the correct solution to this question. But I would recomment to always use the correct solution based on t-statistics: e.g. https://stackoverflow.com/a/15034143/7735095 or https://stackoverflow.com/a/34474255/7735095 – Jakob Dec 22 '21 at 12:13
3

Based on the original but with some concrete examples:

import numpy as np

def mean_confidence_interval(data, confidence: float = 0.95) -> tuple[float, np.ndarray]:
    """
    Returns (tuple of) the mean and confidence interval for given data.
    Data is a np.arrayable iterable.

    ref:
        - https://stackoverflow.com/a/15034143/1601580
        - https://github.com/WangYueFt/rfs/blob/f8c837ba93c62dd0ac68a2f4019c619aa86b8421/eval/meta_eval.py#L19
    """
    import scipy.stats
    import numpy as np

    a: np.ndarray = 1.0 * np.array(data)
    n: int = len(a)
    if n == 1:
        import logging
        logging.warning('The first dimension of your data is 1, perhaps you meant to transpose your data? or remove the'
                        'singleton dimension?')
    m, se = a.mean(), scipy.stats.sem(a)
    tp = scipy.stats.t.ppf((1 + confidence) / 2., n - 1)
    h = se * tp
    return m, h

def ci_test_float():
    import numpy as np
    # - one WRONG data set of size 1 by N
    data = np.random.randn(1, 30)  # gives an error becuase len sets n=1, so not this shape!
    m, ci = mean_confidence_interval(data)
    print('-- you should get a mean and a list of nan ci (since data is in wrong format, it thinks its 30 data sets of '
          'length 1.')
    print(m, ci)

    # right data as N by 1
    data = np.random.randn(30, 1)
    m, ci = mean_confidence_interval(data)
    print('-- gives a mean and a list of length 1 for a single CI (since it thinks you have a single dat aset)')
    print(m, ci)

    # multiple data sets (7) of size N (=30)
    data = np.random.randn(30, 7)
    print('-- gives 7 CIs for the 7 data sets of length 30. 30 is the number ud want large if you were using z(p)'
          'due to the CLT.')
    m, ci = mean_confidence_interval(data)
    print(m, ci)

ci_test_float()

output:

-- you should get a mean and a list of nan ci (since data is in wrong format, it thinks its 30 data sets of length 1.
0.1431623130952463 [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]
-- gives a mean and a list of length 1 for a single CI (since it thinks you have a single dat aset)
0.04947206018132864 [0.40627264]
-- gives 7 CIs for the 7 data sets of length 30. 30 is the number ud want large if you were using z(p)due to the CLT.
-0.03585104402718902 [0.31867309 0.35619134 0.34860011 0.3812853  0.44334033 0.35841138
 0.40739732]

I think the Num_samples by Num_datasets is right but if it's not let me know in the comment section.


For what type of data does it work for?

I think it can be used for any data because of the following:

I believe it is fine since the mean and std are calculated for general numeric data and the z_p/t_p value only takes in the confidence interval and data size, so it is independent of assumptions on the distribution of data.

So it can be used for regression & classification I believe.


As a bonus, a torch implementation that nearly only uses torch only:

def torch_compute_confidence_interval(data: Tensor,
                                      confidence: float = 0.95
                                      ) -> Tensor:
    """
    Computes the confidence interval for a given survey of a data set.
    """
    n: int = len(data)
    mean: Tensor = data.mean()
    # se: Tensor = scipy.stats.sem(data)  # compute standard error
    # se, mean: Tensor = torch.std_mean(data, unbiased=True)  # compute standard error
    se: Tensor = data.std(unbiased=True) / (n ** 0.5)
    t_p: float = float(scipy.stats.t.ppf((1 + confidence) / 2., n - 1))
    ci = t_p * se
    return mean, ci

Some comments on CI (or see https://stats.stackexchange.com/questions/554332/confidence-interval-given-the-population-mean-and-standard-deviation?noredirect=1&lq=1):

"""
Review for confidence intervals. Confidence intervals say that the true mean is inside the estimated confidence interval
(the r.v. the user generates). In particular it says:
    Pr[mu^* \in [mu_n +- t.val(p) * std_n / sqrt(n) ] ] >= p
e.g. p = 0.95
This does not say that for a specific CI you compute the true mean is in that interval with prob 0.95. Instead it means
that if you surveyed/sampled 100 data sets D_n = {x_i}^n_{i=1} of size n (where n is ideally >=30) then for 95 of those
you'd expect to have the truee mean inside the CI compute for that current data set. Note you can never check for which
ones mu^* is in the CI since mu^* is unknown. If you knew mu^* you wouldn't need to estimate it. This analysis assumes
that the the estimator/value your estimating is the true mean using the sample mean (estimator). Since it usually uses
the t.val or z.val (second for the standardozed r.v. of a normal) then it means the approximation that mu_n ~ gaussian
must hold. This is most likely true if n >= 0. Note this is similar to statistical learning theory where we use
the MLE/ERM estimator to choose a function with delta, gamma etc reasoning. Note that if you do algebra you can also
say that the sample mean is in that interval but wrt mu^* but that is borning, no one cares since you do not know mu^*
so it's not helpful.

An example use could be for computing the CI of the loss (e.g. 0-1, CE loss, etc). The mu^* you want is the expected
risk. So x_i = loss(f(x_i), y_i) and you are computing the CI for what is the true expected risk for that specific loss
function you choose. So mu_n = emperical mean of the loss and std_n = (unbiased) estimate of the std and then you can
simply plug in the values.

Assumptions for p-CI:
    - we are making a statement that mu^* is in mu+-pCI = mu+-t_p * sig_n / sqrt n, sig_n ~ Var[x] is inside the CI
    p% of the time.
    - we are estimating mu^, a mean
    - since the quantity of interest is mu^, then the z_p value (or p-value, depending which one is the unknown), is
    computed using the normal distribution.
    - p(mu) ~ N(mu; mu_n, sig_n/ sqrt n), vial CTL which holds for sample means. Ideally n >= 30.
    - x ~ p^*(x) are iid.

Std_n vs t_p*std_n/ sqrt(n)
    - std_n = var(x) is more pessimistic but holds always. Never shrinks as n->infity
    - but if n is small then pCI might be too small and your "lying to yourself". So if you have very small data
    perhaps doing std_n for the CI is better. That holds with prob 99.9%. Hopefuly std is not too large for your
    experiments to be invalidated.

ref:
    - https://stats.stackexchange.com/questions/554332/confidence-interval-given-the-population-mean-and-standard-deviation?noredirect=1&lq=1
    - https://stackoverflow.com/questions/70356922/what-is-the-proper-way-to-compute-95-confidence-intervals-with-pytorch-for-clas
    - https://www.youtube.com/watch?v=MzvRQFYUEFU&list=PLUl4u3cNGP60hI9ATjSFgLZpbNJ7myAg6&index=205
"""
Charlie Parker
  • 5,884
  • 57
  • 198
  • 323
0

Regarding Ulrich's answer - that is using the t-value. We use this when the true variance is unknown. This is when the only data you have is the sample data.

For bogatron's answer, this involves z-tables. The z-tables are used when variance is already known and provided. Then you also have sample data. Sigma is not the estimated standard deviation of the sample mean. It is already known.

Let's say variance is known and we want 95% confidence:

from scipy.stats import norm
alpha = 0.95
# Define our z
ci = alpha + (1-alpha)/2
#Lower Interval, where n is sample siz
c_lb = sample_mean - norm.ppf(ci)*((sigma/(n**0.5)))
c_ub = sample_mean + norm.ppf(ci)*((sigma/(n**0.5)))

With only sample data and an unknown variance (meaning that the variance will have to be calculated solely from sample data), Ulrich's answer works perfectly. However, you probably would like to designate the confidence interval. If your data is a and you want a confidence interval of 0.95:

import statsmodels.stats.api as sms
conf = sms.DescrStatsW(a).tconfint_mean(alpha=0.05)
conf
petezurich
  • 9,280
  • 9
  • 43
  • 57