How to get the cumulative distribution function with NumPy?

Question

I want to create a CDF with NumPy, my code is the next:

histo = np.zeros(4096, dtype = np.int32)
for x in range(0, width):
   for y in range(0, height):
      histo[data[x][y]] += 1
      q = 0 
   cdf = list()
   for i in histo:
      q = q + i
      cdf.append(q)

I am walking by the array but take a long time the program execution. There is a built function with this feature, isn't?

Also see [empirical distribution function](https://en.wikipedia.org/wiki/Empirical_distribution_function) — djvg, Dec 07 '22 at 14:14

score 102 · Answer 1 · edited Jan 03 '16 at 19:33

Using a histogram is one solution but it involves binning the data. This is not necessary for plotting a CDF of empirical data. Let F(x) be the count of how many entries are less than x then it goes up by one, exactly where we see a measurement. Thus, if we sort our samples then at each point we increment the count by one (or the fraction by 1/N) and plot one against the other we will see the "exact" (i.e. un-binned) empirical CDF.

A following code sample demonstrates the method

import numpy as np
import matplotlib.pyplot as plt

N = 100
Z = np.random.normal(size = N)
# method 1
H,X1 = np.histogram( Z, bins = 10, normed = True )
dx = X1[1] - X1[0]
F1 = np.cumsum(H)*dx
#method 2
X2 = np.sort(Z)
F2 = np.array(range(N))/float(N)

plt.plot(X1[1:], F1)
plt.plot(X2, F2)
plt.show()

It outputs the following

enter image description here

as per numpy.histogram docs : __normed__ is equivalent to the __density__ argument, but produces incorrect results for unequal bin widths. Changed in version 1.15.0: DeprecationWarnings are actually emitted. — Oliver Prislan, Oct 14 '21 at 19:39

score 26 · Accepted Answer · answered May 17 '12 at 19:15

26

I'm not really sure what your code is doing, but if you have hist and bin_edges arrays returned by numpy.histogram you can use numpy.cumsum to generate a cumulative sum of the histogram contents.

>>> import numpy as np
>>> hist, bin_edges = np.histogram(np.random.randint(0,10,100), normed=True)
>>> bin_edges
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])
>>> hist
array([ 0.14444444,  0.11111111,  0.11111111,  0.1       ,  0.1       ,
        0.14444444,  0.14444444,  0.08888889,  0.03333333,  0.13333333])
>>> np.cumsum(hist)
array([ 0.14444444,  0.25555556,  0.36666667,  0.46666667,  0.56666667,
        0.71111111,  0.85555556,  0.94444444,  0.97777778,  1.11111111])

answered May 17 '12 at 19:15

user545424

15,713
11
56
70

17

However, this introduces a binning step that would not be necessary for a cumulative distribution. – hans_meine Feb 20 '14 at 09:32
3

"This keyword, `normed` is deprecated in Numpy 1.6 due to confusing/buggy behavior. It will be removed in Numpy 2.0. "There is a bug in the code if bin is not in `[0,1]`. Add x=np.cumsum(hist); x=(x - x.min()) / x.ptp() – ArtificiallyIntelligence Jun 23 '16 at 19:29
@hans_meine Exactly. Any better solution for this? – a06e Sep 15 '22 at 21:21
1

@becko Dan's reply above contains both a histogram-based and an "exact" solution ("method 2"). – hans_meine Sep 17 '22 at 11:23
Ah yes I missed that. Thanks @hans_meine – a06e Sep 18 '22 at 15:51

offwhitelotus · Answer 3 · 2014-11-21T18:59:23.627

update for numpy version 1.9.0. user545424's answer does not work in 1.9.0. This works:

>>> import numpy as np
>>> arr = np.random.randint(0,10,100)
>>> hist, bin_edges = np.histogram(arr, density=True)
>>> hist = array([ 0.16666667,  0.15555556,  0.15555556,  0.05555556,  0.08888889,
    0.08888889,  0.07777778,  0.04444444,  0.18888889,  0.08888889])
>>> hist
array([ 0.1       ,  0.11111111,  0.11111111,  0.08888889,  0.08888889,
    0.15555556,  0.11111111,  0.13333333,  0.1       ,  0.11111111])
>>> bin_edges
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])
>>> np.diff(bin_edges)
array([ 0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9])
>>> np.diff(bin_edges)*hist
array([ 0.09,  0.1 ,  0.1 ,  0.08,  0.08,  0.14,  0.1 ,  0.12,  0.09,  0.1 ])
>>> cdf = np.cumsum(hist*np.diff(bin_edges))
>>> cdf
array([ 0.15,  0.29,  0.43,  0.48,  0.56,  0.64,  0.71,  0.75,  0.92,  1.  ])
>>>

user12287, I feel weird editing other people's answers. Besides, different answers for different versions. — offwhitelotus, Dec 30 '14 at 20:59

score 5 · Answer 4 · edited Aug 20 '21 at 09:32

5

To complement Dan's solution. In the case where there are several identical values in your sample, you can use numpy.unique :

Z = np.array([1,1,1,2,2,4,5,6,6,6,7,8,8])
X, F = np.unique(Z, return_index=True)
F=F/X.size

plt.plot(X, F)

edited Aug 20 '21 at 09:32

jtlz2

7,700
9
64
114

answered Aug 26 '15 at 15:08

Alex

59
1
1

1

That gives you values of `F` that are greater than 1. Perhaps you meant to use `F = F / float(F.max())` (also bear in mind that integer division would cause problems for people using Python 2x). – ali_m Aug 26 '15 at 16:19
This answer is old, thank you for your comments and answers. I have seen in each answer my rudimentary approach from three years ago. – omar Aug 26 '15 at 17:35
@Alex this is not quite correct since it should go up by more than 1/N for the entries which are there more than once. You're correct my solution will only be correct for the last one of such occurances but it will plot correctly. – Dan Aug 26 '15 at 19:20
In principle you are using counts, but python uses zero based indexing in F, so perhaps you meant `(F + 1) / (F[-1] + 1)` – Jthorpe Oct 08 '20 at 22:42

score 2 · Answer 5 · answered Feb 23 '23 at 14:06

The existing answers either resort to using a histogram, or don't handle duplicate values nicely/correctly (either ignoring duplicate values, or yielding a CDF that contains multiple y-values for the same x-value). I suggest the following method:

x, CDF_counts = np.unique(data, return_counts = True)
y = np.cumsum(CDF_counts)/np.sum(CDF_counts)

score -3 · Answer 6 · answered Sep 21 '16 at 16:55

-3

I am not sure if there is a ready-made answer, the exact thing to do is to define a function like:

def _cdf(x,data):
    return(sum(x>data))

This will be pretty fast.

answered Sep 21 '16 at 16:55

user1505725

107
1

How to get the cumulative distribution function with NumPy?

6 Answers6

Linked

Related