I need little help. If I have 30 random sample with mean of 52 and variance of 30 then how can i calculate the 95 % confidence interval for the mean with estimated and true variance of 30.
-
1This is not a programming question it is a basic statistics question. Try posting on cross-validated stack exchange site – mathematician1975 Jul 23 '15 at 13:49
-
This question was already asked here: https://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data and correctly answered by https://stackoverflow.com/a/15034143/7735095 and https://stackoverflow.com/a/34474255/7735095 in the case where you estimate the variance from the data. If you assume that you know already for sure that the true variance is exactly 30 without any doubt before seeing the data, then you should use `np.mean(data) +- np.sqrt(30)*statistics.NormalDist().inv_cdf(0.975)/np.sqrt(len(data))`, where `len(data)` is the number of observations. – Jakob Dec 22 '21 at 12:29
1 Answers
Here you can combine the powers of numpy and statsmodels to get you started:
To produce normally distributed floats with mean of 52 and variance of 30 you can use numpy.random.normal with numbers = np.random.normal(loc=52, scale=30, size=30)
where the parameters are:
Parameters ---------- loc : float Mean ("centre") of the distribution. scale : float Standard deviation (spread or "width") of the distribution. size : int or tuple of ints, optional Output shape. If the given shape is, e.g., ``(m, n, k)``, then ``m * n * k`` samples are drawn. Default is None, in which case a single value is returned.
And here's a 95% confidence interval of the mean using DescrStatsW.tconfint_mean:
import statsmodels.stats.api as sms
conf = sms.DescrStatsW(numbers).tconfint_mean()
conf
# output
# (36.27, 56.43)
EDIT - 1
That's not the whole story though... Depending on your sample size, you should use the Z score and not t score that's used by sms.DescrStatsW(numbers).tconfint_mean()
here. And I have a feeling that its not coincidental that the rule-of-thumb threshold is 30, and that you have 30 observations in your question. Z vs t
also depends on whether or not you know the population standard deviation or have to rely on an estimate from your sample. And those are calculated differently as well. Take a look here. If this is something you'd like me to explain and demonstrate further, I'll gladly take another look at it over the weekend.

- 55,229
- 37
- 187
- 305
-
The correct choice of `Z vs T` is independent of the number of samples. In the typical case that one estimates the standard deviation FROM THE DATA, the `t`-distribution is the correct choice (for normal distributed data). In the case where you know the standard deviation from an external source already before looking at the data, then `Z` is the correct choice (for normal distributed data). – Jakob Dec 22 '21 at 11:28
-
For a large number of observations (larger than 30) the two different confidence intervals are quite similar, so if you have a very large number of observations and don't care too much about correctness and precision it might not be too catastrophic if you pick the wrong one. I would still recommend to always pick the correct choice instead of the wrong choice even if you have a large number of samples, even if you cannot see the difference with the naked eye for very large number of observations. – Jakob Dec 22 '21 at 11:32
-
`np.random.normal(loc=52, scale=30, size=30)` creates data with variance 30²=900 (corresponds to standard deviation 30). If you want to generate data with variance 30 (corresponds to standard deviation sqrt(30)), you have to call: `np.random.normal(loc=52, scale=np.sqrt(30), size=30)` – Jakob Dec 22 '21 at 11:35