1

I have a bunch of data points, and I would like to fit a Normal distribution to the data. I see that scipy has the stats.norm.fit method, but this requires a single list of data points. Something like

data = [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5]

whereas my data is contained in two lists, something like.

values = [1, 2, 3, 4, 5]
counts = [4, 3, 6, 1, 3]

How can I fit a normal distribution to data formatted in this way?

Adam Jaamour
  • 1,326
  • 1
  • 15
  • 31
Daniel Standage
  • 8,136
  • 19
  • 69
  • 116

2 Answers2

3

You could expand the values to the full set with numpy.repeat, and use scipy.stats.norm.fit:

In [54]: import numpy as np

In [55]: from scipy.stats import norm

In [56]: values = [1, 2, 3, 4, 5]

In [57]: counts = [4, 3, 6, 1, 3]

In [58]: full_values = np.repeat(values, counts)

In [59]: full_values
Out[59]: array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5])

In [60]: norm.fit(full_values)  # Estimate mean and std. dev.
Out[60]: (2.7647058823529411, 1.3516617991854185)

scipy.stats.norm.fit computes the maximum likelihood estimates of the parameters. For the normal distribution, these are just the sample mean and the square root of the (biased) sample variance. As far as I know, the only relevant weighted statistical function in numpy or scipy is numpy.average. You can do the computation yourself with numpy.average, using counts as the weights argument.

In [62]: sample_mean = np.average(values, weights=counts)

In [63]: sample_mean
Out[63]: 2.7647058823529411

In [64]: sample_var = np.average((values - sample_mean)**2, weights=counts)

In [65]: sample_var
Out[65]: 1.8269896193771626

In [66]: sample_std = np.sqrt(sample_var)

In [67]: sample_std
Out[67]: 1.3516617991854185

Note that statistics.stdev is based on the unbiased sample variance. If that's what you want, you can adjust the scaling by multiplying the biased sample variance by sum(counts)/(sum(counts) - 1):

In [79]: n = sum(counts)

In [80]: sample_var = n/(n-1)*np.average((values - sample_mean)**2, weights=counts)

In [81]: sample_var
Out[81]: 1.9411764705882353

In [82]: sample_std = np.sqrt(sample_var)

In [83]: sample_std
Out[83]: 1.3932610920384718
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
1

The most straightforward way of characterising a sample from a (purported) normal population is to take its mean and standard deviation. We can use the built-in library to do that, along with Martelli's lambda to flatten the sample.

>>> values = [1, 2, 3, 4, 5]
>>> counts = [4, 3, 6, 1, 3]
>>> import statistics
>>> sample = [c*[v] for (c, v) in zip(counts, values)]
>>> sample
[[1, 1, 1, 1], [2, 2, 2], [3, 3, 3, 3, 3, 3], [4], [5, 5, 5]]
>>> flatten = lambda l: [item for sublist in l for item in sublist]
>>> sample = flatten(sample)
>>> sample
[1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5]
>>> statistics.mean(sample)
2.764705882352941
>>> statistics.stdev(sample)
1.3932610920384718
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
  • 1
    @DanielStandage With the mean and standard deviation you can then recreate the Normal Distribution with its probability density function (https://en.wikipedia.org/wiki/Normal_distribution). – KDecker Nov 22 '17 at 17:36
  • @KDecker: Exactly! With the mean and standard deviation you know *everything*. – Bill Bell Nov 22 '17 at 17:37