Cumulative probability of estimated empirical distribution for n-dimensional data

Question

The problem

I have a dataset with 4 numeric features and 1000 datapoints. The distribution of the values is unknown (numpy randint generates uniform ints, but this is just for the purpose of illustration). Given new datapoint (4 numbers) I want to find what is the cumulative probability (single number) of this specific datapoint.

import numpy as np

data = np.random.randint(1, 100, size=(1000, 4))
array([[28, 52, 91, 66],
       [78, 94, 95, 12],
       [60, 63, 43, 37],
       ...,
       [81, 68, 45, 46],
       [14, 38, 91, 46],
       [37, 51, 68, 97]])

new_data = np.random.randint(1, 100, size=(1, 4))
array([[75, 24, 39, 94]])

I've tried:

Scipy

Can estimate pdf, do not know how to estimate cumulative probability. Possible ways are monte-carlo sim or integration (scipy.integrate.nquad) which is too slow for my case Integrate 2D kernel density estimate.

import scipy.stats
kde = scipy.stats.gaussian_kde(data.T)
kde.pdf(new_data)

Scikit-learn

Same as above, do not know how to estimate cumulative probability.

from sklearn.neighbors import KernelDensity
model = KernelDensity()
model.fit(data)
np.exp(model.score_samples(new_data))

Statsmodels

Can not archive anything as this only accept 1d data.

from statsmodels.distributions.empirical_distribution import ECDF
ecdf = ECDF(data[:, 0])
ecdf(new_data[0][0])

The question is, is there a fast and efficient way to estimate cumulative probability of a 4-dimentional datapoint having the provided scipy or sklearn (preferably) models?

Am I moving in the right direction or is there a completely different way to solve this? Maybe variational autoencoders is the way to go? Are there simple ways to solve this?

IIUC you would like to do something like ecdf(new_data) and get a tuple of result, correct? — FBruzzesi, Apr 22 '20 at 13:34
No, what i want is a single cumulative probability of the values of these 4 features for one datapoint. For example input = array([[75, 24, 39, 94]]), output = 0.01. — RaidasGrisk, Apr 22 '20 at 16:10

score 3 · Accepted Answer · answered Apr 22 '20 at 15:44

3

A multivariate ecdf at a point would just compute the fraction of observations with values smaller than the point.

Something like the following

np.random.seed(0)
data = np.random.randint(1, 100, size=(1000, 4))
new_data = np.random.randint(1, 100, size=(2, 4))

def ecdf_mv(new_data, data):
    new_data = np.atleast_2d(new_data)
    ecdf = []
    for row in new_data:
        ecdf.append((data <= row).all(1).mean())

    return np.asarray(ecdf)

ecdf_mv(new_data, data)

array([0.039, 0.002])

some checks:

ecdf_mv(np.ones(4) * 100 / 2, data), 0.5**4
(array([0.067]), 0.0625)

marginal = 100 * np.ones((4, 4)) - 50 * np.eye(4)
ecdf_mv(marginal, data)
array([0.521, 0.515, 0.502, 0.54 ])

In the univariate case we can sort the data to get a fast algorithm to compute the ecdf at the original points.
I don't know if there is a data structure or algorithm that is computationally more efficient than the brute force comparison, if the ecdf has to be evaluated at many points.

answered Apr 22 '20 at 15:44

Josef

21,998
3
54
67

Thanks! I see your point. That would actually work! But how about if my dataset of 1000 datapoints is not homogeneous and I really want to estimate empirical probability distribution and smooth out my model at first and only then use it to estimate the probability of new datapoint. How would I do that? – RaidasGrisk Apr 22 '20 at 16:20
To add on to this, I could draw X datapoints given ecdf and cumpute the fraction of observations with values smaller than the point. I was thinking if there's a better solution. – RaidasGrisk Apr 22 '20 at 16:50
1

An ecdf does not smooth. A smoothed cdf could be computed for example with a kernel cdf similar to kde. I'm not very familiar with that and never tried in the multivariate case. statsmodels has https://www.statsmodels.org/dev/generated/statsmodels.nonparametric.kernel_density.KDEMultivariate.cdf.html but I guess it hasn't been tested or used much. – Josef Apr 22 '20 at 17:58
This is exactly what I am looking for. Been looking for this for a couple of days with no luck. Likely due to as you've said, it hasn't been tested or used much (I wonder why?). If you'd like, please make an answer with that, I'll mark it as the best answer. Thank you so much! – RaidasGrisk Apr 23 '20 at 07:25

Ezequiel Castaño · Answer 2 · 2021-12-13T01:35:15.887

Doing some try and error I found the following:

A pure numpy solution (Based on Josef's):

import numpy as np

def ecdf_mv(new_data, data):
    rows = np.expand_dims(new_data, axis=1)
    ecdf = (data < rows).all(axis=2).mean(axis=1)
    return np.asarray(ecdf)

This will return the empirical cummulative distribution function for an array of points.

If instead the CDF over a multivarate KDE is desired, the following code could be used (this is much slower though):

from statsmodels.nonparametric.kernel_density import KDEMultivariate

def cdf_kde_mv(new_data, data, data_type):
    data_kde = KDEMultivariate(data, var_type=data_type)
    return data_kde.cdf(new_data)

Performance-wise the pure numpy could be faster but more memory-intensive and Josef's approach with the for-loop is faster under some situations.

For the "smothness" feeling, it is not dependent on "ECDF vs CDF over Multivariate KDE" but rather more towards which is your sample size and the number of bins. In the below example, the visualization looks "smooth" even without using KDEs, that is because the sample size is big enough for that number of bins. The interpolation that is interpreted as smothing is matplotlib plot_surface. If you also need the smooth results analytically and not only for graph, consider using the KDE approach.

Applying this with some visualization:

import numpy as np
import matplotlib.pyplot as plt

sample_size = 10_000
bins_count= 20

sample = np.random.multivariate_normal([0, 0], [[1, 0.0], [0.0, 1]] , sample_size)
bins_edges = np.linspace(-4, 4, bins_count + 1)

bins_centers = (bins_edges[:-1] + bins_edges[1:]) / 2
X, Y = np.meshgrid(bins_centers, bins_centers)
x, y = X.ravel(), Y.ravel()

evaluation_points = np.stack([x, y], axis=1)

cummulative_probability = cdf_kde_mv(evaluation_points, sample, "cc")
# Alternatively:
cummulative_probability = ecdf_mv(evaluation_points, sample)
cummulative_probability = cummulative_probability.reshape(bins_count, bins_count)

# Plotting
fig = plt.figure(figsize=(10, 10), constrained_layout=True)
ax = fig.add_subplot(projection='3d')
ax.view_init(elev=20., azim=-135)
ax.plot_surface(X, Y, cummulative_probability, rcount=bins_count, ccount=bins_count, 
                antialiased=False, vmin=0, vmax=cummulative_probability.max(), 
                alpha=0.5, cmap='viridis')
ax.margins(0)
ax.set_zlim(0, 1)
plt.show()

Cumulative probability of estimated empirical distribution for n-dimensional data

The problem

I've tried:

2 Answers2