1

I am working on Windows. I just want to input an array and get the cdf of the array.

Bruce
  • 33,927
  • 76
  • 174
  • 262
  • Do you have matplotlib and numpy/scipy available? – Björn Pollex May 27 '11 at 08:49
  • @Bruce: Also, what do you mean by *cdf of the array*? A cdf has to be calculated with respect to a certain distribution. – Björn Pollex May 27 '11 at 09:25
  • @Space_C0wb0y: We can calculate probabilities assuming X contains outcome of an experiment. e.g. X = [1,3,4,5,7,8,10]. P(X < 1) = 0, P(X<5) = 3/7, P(X<11) = 1 etc. So we can plot P(X) for some known points in the range(X). – Bruce May 28 '11 at 12:02
  • @Bruce: What about `P(X<6)`? Is your data always discrete? Could you perform a density estimate, e.g. using `scipy.stats.gaussian_kde`? – Björn Pollex May 28 '11 at 13:04
  • @Space_C0wb0y: 4/7, yes my data is always discrete. – Bruce May 28 '11 at 16:56
  • @Bruce: I modified my answer, it should solve your problem. – Björn Pollex May 28 '11 at 17:22
  • possible duplicate of [How to plot empirical cdf in matplotlib in Python?](http://stackoverflow.com/questions/3209362/how-to-plot-empirical-cdf-in-matplotlib-in-python) – Dave Feb 04 '15 at 15:33

2 Answers2

5

First, you could implement the CDF like this:

from bisect import bisect_left

class discrete_cdf:
    def __init__(self, data):
        self._data = data # must be sorted
        self._data_len = float(len(data))

    def __call__(self, point):
        return (len(self._data[:bisect_left(self._data, point)]) /
                self._data_len)

Using the above class, you can plot it like this:

from scipy.stats import norm
import matplotlib.pyplot as plt

cdf = discrete_cdf(your_data)
xvalues = range(0, max(your_data))
yvalues = [cdf(point) for point in xvalues]
plt.plot(xvalues, yvalues)

Edit: An arange doesn't make sense there, the cdf will always be the same for all points between x and x+1.

Björn Pollex
  • 75,346
  • 28
  • 201
  • 283
  • Nice solution. You can simplify `len(self._data[:bisect_left(self._data, point)])` to `bisect_left(self._data, point)`, since the slice already specifies the length. Perhaps `bisect_right` would also be better, since the CDF is for points where P(X <= x). – Lars Yencken Mar 14 '12 at 06:08
  • You should keep in mind that self is also included in a function call, therefore when you call "discrete_cdf(your_data)" you are actually doing "discrete_cdf(self, your_data)". So it is better to include "self" in your functions "def __init__(self, data):" – Raein Hashemi Jul 27 '18 at 17:10
  • @RaeinHashemi It is not just better, but in fact required - the code as shown was incorrect, thanks for pointing it out! – Björn Pollex Jul 29 '18 at 14:45
  • The class gives me a syntax error - something missing between `return (len(self._data[:bisect_left(self._data, point)])` and `self._data_len)` in the last two lines of the class? – ru111 Mar 12 '19 at 17:53
  • @ru111 Yes, there should be a division (`/`) there, this got lost in some edit. Thanks for pointing that out! – Björn Pollex Mar 14 '19 at 12:24
2

Is this what you're after? I have provided a function for approximating the cdf and plotted it. (Assuming you want to input a pdf array with y-values)

import matplotlib.pyplot as plt
from math import exp

xmin=0
xmax=5
steps=1000
stepsize=float(xmax-xmin)/float(steps)
xpoints=[i*stepsize for i in range(int(xmin/stepsize),int(xmax/stepsize))]
print xpoints,int(xmin/stepsize),int(xmax/stepsize)

ypoints=map(lambda x: exp(-x),xpoints)

def get_cdf(pdf_array):
    ans=[0]
    for i in range(0,len(pdf_array)-1):
        ans.append(ans[i]+(pdf_array[i]+pdf_array[i+1])/2.0*stepsize)
    return ans

cdfypoints=get_cdf(ypoints)

plt.plot(xpoints,ypoints)
plt.plot(xpoints,cdfypoints)
plt.show()

enter image description here

Rusty Rob
  • 16,489
  • 8
  • 100
  • 116