I have searched around and to my surprise it seems that this question has not been answered.
I have a Numpy array containing 10000 values from measurements. I have plotted a histogram with Matplotlib, and by visual inspection the values seem to be normally distributed:
However, I would like to validate this. I have found a normality test implemented under scipy.stats.mstats.normaltest, but the result says otherwise. I get this output:
(masked_array(data = [1472.8855375088663],
mask = [False],
fill_value = 1e+20)
, masked_array(data = [ 0.],
mask = False,
fill_value = 1e+20)
)
which means that the chances that the dataset is normally distributed are 0. I have re-run the experiments and tested them again obtaining the same outcome, and in the "best" case the p value was 3.0e-290.
I have tested the function with the following code and it seems to do what I want:
import numpy
import scipy.stats as stats
mu, sigma = 0, 0.1
s = numpy.random.normal(mu, sigma, 10000)
print stats.normaltest(s)
(1.0491016699730547, 0.59182113002186942)
If I have understood and used the function correctly it means that the values are not normally distributed. (And honestly I have no idea why there is a difference in the output, i.e. less details.)
I was pretty sure that it is a normal distribution (although my knowledge of statistics is basic), and I don't know what could the alternative be. How can I check what is the probability distribution function in question?
EDIT:
My Numpy array containing 10000 values is generated like this (I know that's not the best way to populate a Numpy array), and afterwards the normaltest is run:
values = numpy.empty(shape=10000, 1))
for i in range(0, 10000):
values[i] = measurement(...) # The function returns a float
print normaltest(values)
EDIT 2:
I have just realised that the discrepancy between the outputs is because I have inadvertently used two different functions (scipy.stats.normaltest() and scipy.stats.mstats.normaltest()), but it does not make a difference since the relevant part of the output is the same regardless of the used function.
EDIT 3:
Fitting the histogram with the suggestion from askewchan:
plt.plot(bin_edges, scipy.stats.norm.pdf(bin_edges, loc=values.mean(), scale=values.std()))
results in this:
EDIT 4:
Fitting the histogram with the suggestion from user user333700:
scipy.stats.t.fit(data)
results in this: