-2

I'm quite new to python world. Also, I'm not a statistician. I'm in the need to implementing mathematical models developed by mathematicians in a computer science programming language. I've chosen python after some research. I'm comfortable with programming as such (PHP/HTML/javascript).

I have a column of values that I've extracted from a MySQL database & in need to calculate the below -

1) Normal distribution of it. (I don't have the sigma & mu values. These need to be calculated too apparently). 
2) Mixture of normal distribution
3) Estimate density of normal distribution
4) Calculate 'Z' score

The array of values looks similar to the one below ( I've populated sample data)-

d1 = [3,3,3,3,3,3,3,9,12,6,3,3,3,3,9,21,3,12,3,6,3,30,12,6,3,3,24,30,3,3,3]


mu1, std1 = norm.fit(d1)

The normal distribution, I understand could be calculated as below -

import numpy as np
from scipy.stats import norm

mu, std = norm.fit(data)

Could I please get some pointers on how to get started with (2),(3) & (4) in this please? I'm continuing to look up online as I look forward to hear from experts.

If the question doesn't fully make sense, please do let me know what aspect is missing so that I'll try & get information around that.

I'd very much appreciate any help here please.

usert4jju7
  • 1,653
  • 3
  • 27
  • 59

2 Answers2

1

Some parts of your question are unclear. It might help to give the context of what you're trying to achieve, rather than what are the specific steps you're taking.

1) + 3) In a Normal distribution - fitting the distribution, and estimating the mean and standard deviation - are basically the same thing. The mean and standard deviation completely determine the distribution.

mu, std = norm.fit(data)

is tantamount to saying "find the mean and standard deviation which best fit the distribution".

4) Calculating the Z score - you'll have to explain what you're trying to do. This usually means how much above (or below) the mean a data point is, in units of standard deviation. Is this what you need here? If so, then it is simply

(np.array(data) - mu) / std

2) Mixture of normal distribution - this is completely unclear. It usually means that the distribution is actually generated by more than a single Normal distribution. What do you mean by this?

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Thank you Ami. I'll have to get back to the mathematics folks & get clarity. Kinda stuck between the maths & computers world .. phew!! :D – usert4jju7 Feb 28 '16 at 19:48
  • Hello Ami - While I wait to discuss with maths folks, I thought I'll update the question with my understanding. For a mixture distribution, as you suggested rightly that there may be several normal distributions, I've updated the question with several normal distributions. Would this now help calculate mixture distribution ? :-) – usert4jju7 Feb 28 '16 at 20:47
  • @usert4jju7 I don't quite understand the update. A mixture distribution is a single distribution that is composed from a number of underlying ones. Your update uses multiple distributions - I just don't see where the mixture comes in. Sorry - I just don't get it. – Ami Tavory Feb 29 '16 at 05:43
  • Thank you Ami. I'll get clarity around this one. Looks like what I've got on hand is pretty confusing. I'll clear this one out, & then come back to you. Thank you very much. – usert4jju7 Feb 29 '16 at 06:32
  • Hello Ami - I had a discussion further to better understand the requirements. There are many steps in the model to be designed. I've created a new question here with better clarity - `http://stackoverflow.com/questions/35740095/python-generate-multivariate-mixture-t-distribution`. Would you be happy to share your expertise & help with me please? – usert4jju7 Mar 02 '16 at 06:35
1

About (2), a web search for "mixture of Gaussians Python" should turn up a lot of hits.

The mixture of Gaussians is a pretty simple idea -- instead of a single Gaussian bump, the density contains multiple bumps. The density is a weighted sum $\sum_k \alpha_k g(x, \mu_k, \sigma_k^2)$ where the weights $\alpha_k$ are positive and sum to 1, and $g(x, \mu, \sigma^2)$ is a single Gaussian bump.

To determine the parameters $\alpha_k$, $\mu_k$, and $\sigma_k^2$, typically one uses the so-called expectation-maximization (EM) algorithm. Again a web search should find many hits. The EM algorithm for a Gaussian mixture is implemented in some Python libraries. It is not too complicated to write it yourself, but maybe to get started you can use an existing implementation.

Robert Dodier
  • 16,905
  • 2
  • 31
  • 48