0

Lets say I have a large data set to where I can manipulate it all in some sort analysis. Which can be looking at values in a probability distribution.

Now that I have this large data set, I then want to compare known, actual data to it. Primarily, how many of the values in my data set have the same value or property with the known data. For example:

enter image description here

This is a cumulative distribution. The continuous lines are from generated data from simulations and the decreasing intensities are just predicted percentages. The stars are then observational (known) data, plotted against generated data.

Another example I have made is how visually the points could possibly be projected on a histogram:

enter image description here

I'm having difficulty marking where the known data points fall in the generated data set and plot it cumulatively along side the distribution of the generated data.

If I were to try and retrieve the number of points that fall in the vicinity of the generated data, I would start out like this (its not right):

def SameValue(SimData, DefData, uncert):
     numb = [(DefData-uncert) < i < (DefData+uncert) for i in SimData]
     return sum(numb)

But I am having trouble accounting for the points falling in the value ranges and then having it all set up to where I can plot it. Any idea on how to gather this data and project this onto a cumulative distribution?

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
iron2man
  • 1,787
  • 5
  • 27
  • 39
  • To whomever downvoted my post, would you elaborate on why so I can improve on whatever I am doing wrong? – iron2man Apr 01 '17 at 22:39

1 Answers1

1

The question is pretty chaotic with lots of irrelevant information but staying vague at the essetial points. I will try interprete it the best I can.

I think what you are after is the following: Given a finite sample from an unknown distribution, what is the probability to obtain a new sample at a fixed value?

I'm not sure if there is a general answer to it, but in any case that would be a question to be asked to statistics or mathematics people. My guess is that you would need to make some assumptions about the distribution itself.

For the practical case however, it might be sufficient to find out in which bin of the sampled distribution the new value would lie.

So assuming we have a distribution x, which we divide into bins. We can compute the histogram h, using numpy.histogram. The probability to find a value in each bin is then given by h/h.sum().
Having a value v=0.77, of which we want to know the probability according to the distribution, we can find out the bin in which it would belong by looking for the index ind in the bin array where this value would need to be inserted for the array to stay sorted. This can be done using numpy.searchsorted.

import numpy as np; np.random.seed(0)

x = np.random.rayleigh(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
prob = h/float(h.sum())

ind = np.searchsorted(bins, 0.77, side="right")
print prob[ind] # which prints 0.058

So the probability is 5.8% to sample a value in the bin around 0.77.

A different option would be to interpolate the histogram between the bin centers, as to find the the probability.

In the code below we plot a distribution similar to the one from the picture in the question and use both methods, the first for the frequency histogram, the second for the cumulative distribution.

import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt

x = np.random.rayleigh(size=1000)
y = np.random.normal(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
hcum = np.cumsum(h)/float(np.cumsum(h).max())

points = [[.77,-.55],[1.13,1.08],[2.15,-.3]]
markers = [ur'$\u2660$',ur'$\u2665$',ur'$\u263B$']
colors = ["k", "crimson" , "gold"]
labels = list("ABC")

kws = dict(height_ratios=[1,1,2], hspace=0.0)
fig, (axh, axc, ax) = plt.subplots(nrows=3, figsize=(6,6), gridspec_kw=kws, sharex=True)

cbins = np.zeros(len(bins)+1)
cbins[1:-1] = bins[1:]-np.diff(bins[:2])[0]/2.
cbins[-1] = bins[-1]
hcumc = np.linspace(0,1, len(cbins))
hcumc[1:-1] = hcum
axc.plot(cbins, hcumc, marker=".", markersize="2", mfc="k", mec="k" )
axh.bar(bins[:-1], h, width=np.diff(bins[:2])[0], alpha=0.7, ec="C0", align="edge")
ax.scatter(x,y, s=10, alpha=0.7)

for p, m, l, c in zip(points, markers, labels, colors):
    kw = dict(ls="", marker=m, color=c, label=l, markeredgewidth=0, ms=10)
    # plot points in scatter distribution
    ax.plot(p[0],p[1], **kw)
    #plot points in bar histogram, find bin in which to plot point
    # shift by half the bin width to plot it in the middle of bar
    pix = np.searchsorted(bins, p[0], side="right")
    axh.plot(bins[pix-1]+np.diff(bins[:2])[0]/2., h[pix-1]/2., **kw)
    # plot in cumulative histogram, interpolate, such that point is on curve.
    yi = np.interp(p[0], cbins, hcumc)
    axc.plot(p[0],yi, **kw)
ax.legend()
plt.tight_layout()  
plt.show()

enter image description here

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
  • Thank you very much for taking the time to give a concise answer. I'm going to see if I can make this work with my data and see where I can go from there. – iron2man Apr 02 '17 at 17:18