How can I create a continuous distribution of a dataset?

Question

I wish to create a continuous probability distribution from this dataset.

The 'Value' shows a measured value and the 'Weight' is the probability of measuring this value in this measurement.

I already graphed the data. On the x-axis it shows the value, and the Y-axis the probability. But I wish to create an exact distribution to fit this data.

In my data-analysis I eventually wish to compare several data distributions by their parameters. I hope you guys can help me out.

Line #	Value	Weight
0	0.0538502	0.016508
1	0.0184823	0.0298487
2	0.0647929	0.0122637
3	0.0262852	0.0234716
4	0.0447611	0.0197072
5	0.0643164	0.0165399
6	0.0709176	0.0143751
7	0.0871276	0.012253
8	0.0341064	0.0197392
9	0.0593696	0.0143858
10	0.0436119	0.0202617
11	0.0505131	0.0191846
12	0.0378706	0.0207842
13	0.0298233	0.0250712
14	0.157727	0.0111866
15	0.0556603	0.0186408
16	0.0542849	0.017617
17	0.0395772	0.0180969
18	0.0694962	0.0117305
19	0.0343318	0.0229277
20	0.139291	0.00907511
22	0.0232517	0.0186514
23	0.207768	0.0069423
24	0.0156452	0.021872
25	0.117749	0.0100989
26	0.124017	0.0111973
27	0.0679313	0.0133407
28	0.0733413	0.0117198
29	0.100553	0.0133407
30	0.0695865	0.016508
31	0.117732	0.0138633
32	0.0540577	0.0170518
33	0.0736274	0.0170625
34	0.0332381	0.0293155
35	0.0803423	0.0159961
36	0.0465	0.0191846
37	0.0889299	0.0159854
38	0.053232	0.020251
39	0.131361	0.0122637
40	0.0233194	0.0240048
41	0.830735	0.0053107
42	0.341012	0.0069423
43	0.101263	0.0106534
44	0.127061	0.00959765
45	0.13706	0.0122637
46	0.120035	0.0106641
47	0.0801194	0.0138526
48	0.0617996	0.0165186
49	0.197555	0.0117305
50	0.0810635	0.0133301
51	0.0178539	0.0335811
52	0.0391433	0.0170518
53	0.0663863	0.0133194
54	0.0617675	0.0170625
55	0.00684359	0.0346582
56	0.0642299	0.0133301
57	0.00970105	0.0239941
58	0.0307687	0.0213068
59	0.0160796	0.0255937
60	0.0147901	0.0266388
61	0.073745	0.0122637
62	0.0420728	0.0207949
63	0.0211625	0.0207949
66	0.0241562	0.0255937
67	0.0329688	0.0239834
68	0.0739628	0.0181289
69	0.0149927	0.0266388
70	0.0130271	0.0378467
73	0.0107957	0.0351914
74	0.040447	0.0175744
75	0.00123215	0.0559756
76	0.0134575	0.0309151
77	0.00592594	0.0453116

Can you be more specific as to what you mean by "creating a distribution"? — Sebastian Baltser, Apr 09 '21 at 17:25
I do not yet know specifically what distribution will best fit the data. Do you if there is a way to test this? I'm not very known with working with distributions. — yungdurum, Apr 09 '21 at 17:37
You should link to other SO questions you looked up first and explain why they didn't answer your question (i.e. why your question is unique). — ThatNewGuy, Apr 09 '21 at 17:38
Please don't paste images of code or data, and please provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). — Pierre D, Apr 09 '21 at 17:57
What are "slope of the segment" and "probability of slope value" referring to? How where they calculated? — Pierre D, Apr 09 '21 at 18:01
@ThatNewGuy Are you actually advocating that OP link to all negative search results with explanations as to the lack of relevance? They certainly should do due diligence searching for solutions, but what you seem to be asking isn't really feasible, nor is it a requirement for SO. — pjs, Apr 09 '21 at 19:04
What are the weights, and how have they been determined? Are they estimates of the density? They can't be probabilities, since you've stated this is data from a continuous distribution. — pjs, Apr 09 '21 at 19:06
I changed the explanation of the problem. Hopefully its more clear now. — yungdurum, Apr 09 '21 at 19:18
What you're calling the 'probability weight' can not be a probability if this is a continuous distribution since P{X = x} = 0 for all x for continuous distributions, so I still don't know what that column means. How was it constructed? — pjs, Apr 09 '21 at 19:23
@pjs The "values" are related to the state of an enzym. I calculated the weights by dividing the time in the corresponding state by the total time (sum of the corresponding times). Yes, it are estimates of the densities. — yungdurum, Apr 09 '21 at 19:25
So i basically want to translate this dataset to a probability distribution. — yungdurum, Apr 09 '21 at 19:32
Are you asking for a curve fit here? Are you looking for a polynomial that fits your data? — Tim Roberts, Apr 09 '21 at 19:42
No, rather not, I think it might would work if I then normalize it. But it is necessary that I'am able to compare this distribution to other distributions. I think the best way to do this is by knowing the distribution and then comparing the parameters — yungdurum, Apr 09 '21 at 19:49
@pjs not all, just one or two would be fine. The original question was very vague and could easily have been interpreted as a duplicate. The point is to give context for the question. — ThatNewGuy, Apr 09 '21 at 20:48
@yungdurum I hate to sound like a pest, but the weights are ***not*** probabilities because they sum up to ~1.391. This would also seem to contradict your comment where you said they were derived by dividing the time-in-state by the total time. — pjs, Apr 09 '21 at 21:56
@pjs Thanks! :), you actually identified an error in my code. . — yungdurum, Apr 10 '21 at 14:10

Pierre D · Answer 1 · 2021-04-10T01:25:56.940

It looks like the data you have is a sort of (non-normalized) histogram.

The first task is of course to plot it:

df = df.sort_values('Value')
plt.plot(df['Value'], df['Weight'])
plt.xlabel('value')
plt.ylabel('weight')

At first glance, it could indicate an exponential or a power-law distribution, but let's see.

Let's first try to smooth out that curve:

import statsmodels.api as sm

x, w = df['Value'].values, df['Weight'].values
s = pd.DataFrame(sm.nonparametric.lowess(w, x, frac=0.2), columns=['x', 'w']).set_index('x').squeeze()
s = s.reindex(np.linspace(x.min(), x.max(), 200), method='ffill', limit=1).interpolate()
s.plot()
plt.plot(x, w, '.')

That gives an okay-ish fit:

We can then use that to generate a fake, crude "sample" following that smooth pdf:

sample = np.random.choice(s.index, p=s/s.sum(), size=1000)

At that point, you can make QQ plots with various distributions following your intuition, and select one that seems to fit well:

from scipy.stats import _continuous_distns as distns

# trying a normal (the default)
sm.qqplot(sample, line='q')
plt.title('Normal')

Clearly not a good fit at all (but we knew that from a first glance at the data):

# trying an exponential
sm.qqplot(sample, distns.expon, line='q')
plt.title('Exponential')

Not very good either:

Perhaps a power-law would fit better?

# we are only interested in the parameter a, so we are
# not going to let loc and scale be fitted;
# instead, we will freeze them at loc=0, scale=1
a, loc, size = distns.powerlaw.fit(sample, floc=0, fscale=1)

# then, we do the QQ plot with the fitted parameter a
sm.qqplot(sample, distns.powerlaw, distargs=(a,), line='q')
plt.title(f'Power law with a={a}')

Corresponding distribution and how to use it

You can now instantiate a distribution following what was found (type and parameters), draw random variates from it, and also plot the pdf directly for comparison purposes with the original data:

g = distns.powerlaw(a=a)

# new points drawn according to g
v = g.rvs(size=100000)
plt.hist(v, bins=100, density=True, histtype='step');

Direct pdf plot and comparison with the original data:

y = g.pdf(x)
plt.plot(x, y/y.sum())
plt.plot(x, w/w.sum(), '.')
plt.title('Normalized pdf and original sample data')

Last word

So, where to go from here? You should look in depth into that distribution and its physical meaning, and see if that makes sense in the context of your experimental setup.

Wow thanks a lot. This is my first project with working with distributions in Python, so maybe it sounds stupid. But I was thinking I could also interpret the weight as the frequency of measuring that 'value' (it is actually based on it)'. What if I can create a histogram based on those frequencies? And then try to fit a distributions to it? Does this make sense? I also found this other 'solution' of finding the best distribution. https://stackoverflow.com/questions/6620471/fitting-empirical-distribution-to-theoretical-ones-with-scipy-python?lq=1 — yungdurum, Apr 11 '21 at 17:38

How can I create a continuous distribution of a dataset?

1 Answers1