61

I have a histogram

H=hist(my_data,bins=my_bin,histtype='step',color='r')

I can see that the shape is almost gaussian but I would like to fit this histogram with a gaussian function and print the value of the mean and sigma I get. Can you help me?

Brian
  • 13,996
  • 19
  • 70
  • 94
  • 2
    "fit this histogram with a gaussian function"? Usually we just compute the mean and standard deviation of the histogram directly. What do you mean by "fit this histogram with a gaussian function"? – S.Lott Oct 18 '11 at 10:13
  • how can you compute the mean and standard deviation "directly". What if the histogram is not really a gaussian and I want to fit it, let's say, with a log-normal distribution? – Brian Oct 18 '11 at 10:18
  • 2
    There are equations for the mean and standard deviation of any set of data points regardless of their distribution. And any curve (such as a straight line y = mx + b) can be fit to any set of data. You will need to read up on basic statistical functions (mean, median, mode, variance, ...) and least-squares approximation. Understand curve fitting for basic (linear and quadratic) functions first before trying it out on more complex curves. – Dave Oct 18 '11 at 10:59
  • 2
    Curve fitting is not actually required, if you've got the data. Just find the mean and the standard deviation, and plug them into the formula for the normal (aka Gaussian) distribution (http://en.wikipedia.org/wiki/Normal_distribution). – Thomas K Oct 18 '11 at 11:57
  • 1
    The mean of a histogram is `sum( value*frequency for value,frequency in h )/sum( frequency for _,frequency in h )`. The standard deviation is equally simple -- but a bit long for a comment. Can you please **update** the question to explain in more detail what you're trying to do? – S.Lott Oct 18 '11 at 12:08
  • @ThomasK: BTW, finding the mean and standard deviation is actually curve fitting for a Gaussian distribution (and is the optimal fitting mechanism in certain senses). – Mr Fooz Dec 24 '11 at 13:04

5 Answers5

83

Here you have an example working on py2.6 and py3.2:

from scipy.stats import norm
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

# read data from a text file. One number per line
arch = "test/Log(2)_ACRatio.txt"
datos = []
for item in open(arch,'r'):
    item = item.strip()
    if item != '':
        try:
            datos.append(float(item))
        except ValueError:
            pass

# best fit of data
(mu, sigma) = norm.fit(datos)

# the histogram of the data
n, bins, patches = plt.hist(datos, 60, normed=1, facecolor='green', alpha=0.75)

# add a 'best fit' line
y = mlab.normpdf( bins, mu, sigma)
l = plt.plot(bins, y, 'r--', linewidth=2)

#plot
plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title(r'$\mathrm{Histogram\ of\ IQ:}\ \mu=%.3f,\ \sigma=%.3f$' %(mu, sigma))
plt.grid(True)

plt.show()

enter image description here

joaquin
  • 82,968
  • 29
  • 138
  • 152
  • 1
    I want to do this to my dataset, without scaling, thus getting my data's sigma.. Not some scaled sigma! – Mossa Mar 08 '12 at 23:06
  • 1
    @user2820579 What do you mean with the "fitting of the height" ? This post perfectly answers the question on the OP. If it doesn't fit your particular problem, make a new question, but do not downvote a valid answer. – joaquin Nov 19 '15 at 19:28
  • Sorry, I misunderstood the `(mu, sigma) = norm.fit(datos)`. – user2820579 Nov 19 '15 at 19:45
  • Is this a Guassian fit? – J W Mar 06 '17 at 12:26
  • @JamesWarner, Yes, use data parameters to draw the line for a normal distribution – joaquin Mar 16 '17 at 10:31
  • For the Histogram plot, I would label the Y axis as density or frequency rather than probability to avoid the wrath of Statisticians. https://math.stackexchange.com/a/105456 – talekeDskobeDa Jan 21 '19 at 10:33
  • 5
    Because of deprecation issues, "scipy.stats.norm.pdf" is preferable instead of "mlab.normpdf" – Tommaso Di Noto May 16 '19 at 15:34
  • @MossaNova you can read how the norming is done in the docs (https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html). Bascially you just need to multiply the normpdf-values by `sum of all your bin-heights` * `bin-width` – N4ppeL Nov 21 '19 at 10:02
  • 1
    "AttributeError: module 'matplotlib.mlab' has no attribute 'norm'": Use `from scipy.stats import norm` and `norm.pdf()` instead – kabhel Jan 19 '22 at 12:44
30

Here is an example that uses scipy.optimize to fit a non-linear functions like a Gaussian, even when the data is in a histogram that isn't well ranged, so that a simple mean estimate would fail. An offset constant also would cause simple normal statistics to fail ( just remove p[3] and c[3] for plain gaussian data).

from pylab import *
from numpy import loadtxt
from scipy.optimize import leastsq

fitfunc  = lambda p, x: p[0]*exp(-0.5*((x-p[1])/p[2])**2)+p[3]
errfunc  = lambda p, x, y: (y - fitfunc(p, x))

filename = "gaussdata.csv"
data     = loadtxt(filename,skiprows=1,delimiter=',')
xdata    = data[:,0]
ydata    = data[:,1]

init  = [1.0, 0.5, 0.5, 0.5]

out   = leastsq( errfunc, init, args=(xdata, ydata))
c = out[0]

print "A exp[-0.5((x-mu)/sigma)^2] + k "
print "Parent Coefficients:"
print "1.000, 0.200, 0.300, 0.625"
print "Fit Coefficients:"
print c[0],c[1],abs(c[2]),c[3]

plot(xdata, fitfunc(c, xdata))
plot(xdata, ydata)

title(r'$A = %.3f\  \mu = %.3f\  \sigma = %.3f\ k = %.3f $' %(c[0],c[1],abs(c[2]),c[3]));

show()

Output:

A exp[-0.5((x-mu)/sigma)^2] + k 
Parent Coefficients:
1.000, 0.200, 0.300, 0.625
Fit Coefficients:
0.961231625289 0.197254597618 0.293989275502 0.65370344131

gaussian plot with fit

serv-inc
  • 35,772
  • 9
  • 166
  • 188
Ralph
  • 409
  • 5
  • 3
  • I am wondering why i get very different fits when using your function and the one joaquin proposes? see my related question for details.... https://stackoverflow.com/questions/44630658/difference-scipy-stats-norm-mlab-normpdf-and-fitting-gaussian – Sebastiano1991 Jun 19 '17 at 13:07
9

Starting Python 3.8, the standard library provides the NormalDist object as part of the statistics module.

The NormalDist object can be built from a set of data with the NormalDist.from_samples method and provides access to its mean (NormalDist.mean) and standard deviation (NormalDist.stdev):

from statistics import NormalDist

# data = [0.7237248252340628, 0.6402731706462489, -1.0616113628912391, -1.7796451823371144, -0.1475852030122049, 0.5617952240065559, -0.6371760932160501, -0.7257277223562687, 1.699633029946764, 0.2155375969350495, -0.33371076371293323, 0.1905125348631894, -0.8175477853425216, -1.7549449090704003, -0.512427115804309, 0.9720486316086447, 0.6248742504909869, 0.7450655841312533, -0.1451632129830228, -1.0252663611514108]
norm = NormalDist.from_samples(data)
# NormalDist(mu=-0.12836704320073597, sigma=0.9240861018557649)
norm.mean
# -0.12836704320073597
norm.stdev
# 0.9240861018557649
Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
5

Here is another solution using only matplotlib.pyplot and numpy packages. It works only for Gaussian fitting. It is based on maximum likelihood estimation and have already been mentioned in this topic. Here is the corresponding code :

# Python version : 2.7.9
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt

# For the explanation, I simulate the data :
N=1000
data = np.random.randn(N)
# But in reality, you would read data from file, for example with :
#data = np.loadtxt("data.txt")

# Empirical average and variance are computed
avg = np.mean(data)
var = np.var(data)
# From that, we know the shape of the fitted Gaussian.
pdf_x = np.linspace(np.min(data),np.max(data),100)
pdf_y = 1.0/np.sqrt(2*np.pi*var)*np.exp(-0.5*(pdf_x-avg)**2/var)

# Then we plot :
plt.figure()
plt.hist(data,30,normed=True)
plt.plot(pdf_x,pdf_y,'k--')
plt.legend(("Fit","Data"),"best")
plt.show()

and here is the output.

Community
  • 1
  • 1
Bouliech
  • 84
  • 1
  • 3
4

I was a bit puzzled that norm.fit apparently only worked with the expanded list of sampled values. I tried giving it two lists of numbers, or lists of tuples, but it only appeared to flatten everything and threat the input as individual samples. Since I already have a histogram based on millions of samples, I didn't want to expand this if I didn't have to. Thankfully, the normal distribution is trivial to calculate, so...

# histogram is [(val,count)]
from math import sqrt

def normfit(hist):
    n,s,ss = univar(hist)
    mu = s/n
    var = ss/n-mu*mu
    return (mu, sqrt(var))

def univar(hist):
    n = 0
    s = 0
    ss = 0
    for v,c in hist:
        n += c
        s += c*v
        ss += c*v*v
    return n, s, ss

I'm sure this must be provided by the libraries, but as I couldn't find it anywhere, I'm posting this here instead. Feel free to point to the correct way to do it and downvote me :-)

Ketil Malde
  • 311
  • 3
  • 6
  • 1
    Thanks! Yeah, it would be nice to see something like this, which I wrote up at [ENH: Accept run-length encoded data as input to functions like scipy.stats.describe() · Issue #15679 · scipy/scipy](https://github.com/scipy/scipy/issues/15679) – nealmcb Mar 01 '22 at 02:26