0

I have the following lines of data in the file (of course much more lines):

data1 0.20
data2 2.32
data3 0.02
dataX x.xx

data1 1.13
data2 3.10
data3 0.96
dataX x.xx

....

I'd like to create probability distribution for each data*. I can do that by hand but maybe there is a library which let me do that more automatically. Ideally I would like to avoid preformatting lines (and feed the library with the above lines but if it is not possible I will have to).

UPDATE

Sorry for inaccuracy. What I wanted to find is how many numbers fall into custom ranges. Example:

[0.0 - 0.1) - 2 numbers;
[0.1 - 0.2) - 3 numbers;
[0.2 - 0.3) - ...

Of course I would like to easily set different ranges (wider or narrower) and then - having that - I would like to generate charts.

pb100
  • 736
  • 3
  • 11
  • 20

2 Answers2

1

The concept of 'probability' is a little subtle - if the data are the output of a stationary stochastic process, then you could estimate probabilities of future outputs of that process by measuring past outputs. But the identical dataset could have been generated deterministically, in which case there is no probability involved, and each time you run the process you'll get the identical data (instead of different data with a similar distribution).

In either case, you can get a distribution of your data by binning it into histograms. Formatting the data into separate lists can be done with:

import collections, re

data = ["data1 0.20", "data2 2.32", "data3 0.02",
        "data1 1.13", "data2 3.10", "data3 0.96" ]

hist = collections.defaultdict(list)
for d in data:
    m = re.match("data(\d+)\s+(\S+)", d)
    if m:
        hist[int(m.group(1))].append(float(m.group(2)))
for k in hist.keys():
    print(k, hist[k])

producing:

1 [0.2, 1.13]
2 [2.32, 3.1]
3 [0.02, 0.96]

You can then build the histograms using Howto bin series of float values into histogram in Python?. And finally, normalize the bin values so that they sum to 1.0 (divide each bin by the total of all bins) to make a probability distribution. Not the probability distribution used to create the data, but an approximation to it.

Community
  • 1
  • 1
Dave
  • 3,834
  • 2
  • 29
  • 44
0

You could use scipy stats norm (and collections).

To split up your data (I think you mean to have it in this form):

raw_data = ( line.split() for line in raw_data.split('\n') )

data = collections.defaultdict(list)
for item in raw_data:
    data[item[0]] = item[1]

data['data1'] # [0.2, 1.13...]

Then for each data set:

for i in xrange(X):
    scipy.stats.norm.fit(data['data'+i]) # (mean, standard deviation)

scipy.stats.norm.fit(data['data1']) # (0.66499999999999992, 0.46499999999999991)

It's unclear precisely what probability you have in mind, but mean and standard deviation are a good start (you can find others in the scipy's statistical functions).

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535