The problem
I have a dataset with 4 numeric features and 1000 datapoints. The distribution of the values is unknown (numpy randint generates uniform ints, but this is just for the purpose of illustration). Given new datapoint (4 numbers) I want to find what is the cumulative probability (single number) of this specific datapoint.
import numpy as np
data = np.random.randint(1, 100, size=(1000, 4))
array([[28, 52, 91, 66],
[78, 94, 95, 12],
[60, 63, 43, 37],
...,
[81, 68, 45, 46],
[14, 38, 91, 46],
[37, 51, 68, 97]])
new_data = np.random.randint(1, 100, size=(1, 4))
array([[75, 24, 39, 94]])
I've tried:
Scipy
Can estimate pdf, do not know how to estimate cumulative probability. Possible ways are monte-carlo sim or integration (scipy.integrate.nquad) which is too slow for my case Integrate 2D kernel density estimate.
import scipy.stats
kde = scipy.stats.gaussian_kde(data.T)
kde.pdf(new_data)
Scikit-learn
Same as above, do not know how to estimate cumulative probability.
from sklearn.neighbors import KernelDensity
model = KernelDensity()
model.fit(data)
np.exp(model.score_samples(new_data))
Statsmodels
Can not archive anything as this only accept 1d data.
from statsmodels.distributions.empirical_distribution import ECDF
ecdf = ECDF(data[:, 0])
ecdf(new_data[0][0])
The question is, is there a fast and efficient way to estimate cumulative probability of a 4-dimentional datapoint having the provided scipy or sklearn (preferably) models?
Am I moving in the right direction or is there a completely different way to solve this? Maybe variational autoencoders is the way to go? Are there simple ways to solve this?