Empirical Distribution Function in Numpy

Question

I have the following list of values:

x = [-0.04124324405924407, 0, 0.005249724476788287, 0.03599351958245578, -0.00252785423151014, 0.01007584102031178, -0.002510349639322063,...]

and I want to calculate the empirical density function, so I think I need to calculate the empirical cumulative distribution function and I've used this code:

counts = np.asarray(np.bincount(x), dtype=float)
cdf = counts.cumsum() / counts.sum()

and then I calculate this value:

print cdf[0.01007584102031178]

and I always get 1 so I guess I made a mistake. Do you know how to fix it? Thanks!

score 8 · Answer 1 · answered Apr 01 '16 at 12:01

The usual definition of the empirical cdf is the number of observations lesser than or equal to the given value divided by the total number of observations. Using 1d numpy arrays this is x[x <= v].size / x.size (float division, in python2 you need from __future__ import division):

x = np.array([-0.04124324405924407,  0,
               0.005249724476788287, 0.03599351958245578,
              -0.00252785423151014,  0.01007584102031178,
              -0.002510349639322063])
v = 0.01007584102031178
print(x[x <= v].size / x.size)

Will print 0.857142857143, (the actual value if the empirical cdf at 0.01007584102031178 is 6 / 7).

This is quite expensive if your array is large and you need to compute the cdf for several values. In such cases you can keep a sorted copy of your data and use np.searchsorted() to find out the number of observations <= v:

def ecdf(x):
    x = np.sort(x)
    def result(v):
        return np.searchsorted(x, v, side='right') / x.size
    return result

cdf = ecdf(x)
print(cdf(v))

At least in Python 2.7 I need to convert to a float, e.g. ```return np.searchsorted(x, v, side='right') / float(x.size)```, otherwise the return statement is an integer / integer, and so returns either 0 or 1. — ashman, Jan 23 '19 at 22:36
@MikeWojnowicz You shouldn't have to if you use `from __future__ import division` as advised in the answer. — Stop harming Monica, Jan 24 '19 at 09:20

score 2 · Answer 2 · answered Apr 01 '16 at 10:56

There are two things going wrong here:

np.bincount only makes sense on an array of integers. It creates a histogram of the array values, rounded to an integer. For a more soffisticated histogram, use np.histogram. It can work on floats, and you can explicitely state bin count or bin borders, as well as normalization.

Additionally, cdf denotes a normal numpy array in your case. The array indices can only be integers, so your query cdf[0.01007584102031178] is rounded down to cdf[0].

So in total, your code does first count the integers (they are all rounded to 0), so your normalized cdf is afterwards just cdf == [ 1. ]. Then you index gets rounded down, so you query cdf[0] which is 1.

Thank you very much. Should I do this: counts = np.asarray(np.histogram(x)) ? I'm not very well with this kind of methods... — Angelina, Apr 01 '16 at 11:13
No, you don't have to cast NumPy results to in array, they are already NumPy arrays themselves. — jojonas, Apr 01 '16 at 14:11

Empirical Distribution Function in Numpy

2 Answers2

Linked