I am trying to come up with a way to determine the "best fit" between the following distributions:
Gaussian, Multinomial, Bernoulli
.
I have a large pandas df
, where each column can be thought of as a distribution of numbers. What I am trying to do, is for each column, determine the distribution of the above list as the best fit
.
I noticed this question which asks something familiar, but these all look like discrete distribution tests, not continuous. I know scipy has metrics for a lot of these, but I can't determine how to to properly place the inputs. My thought would be:
- For each column, save the data in a temporary
np array
- Generate
Gaussian, Multinomial, Bernoulli
distributions, perform aSSE
test to determine the distribution that gives the "best fit", and move on to the next column.
An example dataset (arbitrary, my dataset is 29888 x 73231
) could be:
| could | couldnt | coupl | cours | death | develop | dialogu | differ | direct | director | done |
|:-----:|:-------:|:-----:|:-----:|:-----:|:-------:|:-------:|:------:|:------:|:--------:|:----:|
| 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
| 0 | 2 | 1 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
| 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 |
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 2 |
| 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 1 | 0 | 0 | 5 | 0 | 0 | 0 | 3 |
| 1 | 1 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 4 | 0 | 0 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 1 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 2 |
| 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 0 | 1 | 0 | 3 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
I have some basic code now, which was edited from this question, which attempts this:
import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels as sm
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')
# Create models from data
def best_fit_distribution(data, bins=200, ax=None):
"""Model data by finding best fit distribution to data"""
# Get histogram of original data
y, x = np.histogram(data, bins=bins, density=True)
x = (x + np.roll(x, -1))[:-1] / 2.0
# Distributions to check
DISTRIBUTIONS = [
st.norm, st.multinomial, st.bernoulli
]
# Best holders
best_distribution = st.norm
best_params = (0.0, 1.0)
best_sse = np.inf
# Estimate distribution parameters from data
for distribution in DISTRIBUTIONS:
# Try to fit the distribution
try:
# Ignore warnings from data that can't be fit
with warnings.catch_warnings():
warnings.filterwarnings('ignore')
# fit dist to data
params = distribution.fit(data)
# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]
# Calculate fitted PDF and error with fit in distribution
pdf = distribution.pdf(x, loc=loc, scale=scale, *arg)
sse = np.sum(np.power(y - pdf, 2.0))
# if axis pass in add to plot
try:
if ax:
pd.Series(pdf, x).plot(ax=ax)
end
except Exception:
pass
# identify if this distribution is better
if best_sse > sse > 0:
best_distribution = distribution
best_params = params
best_sse = sse
except Exception:
print("Error on: {}".format(distribution))
pass
#print("Distribution: {} | SSE: {}".format(distribution, sse))
return best_distribution.name, best_sse
for col in df.columns:
nm, pm = best_fit_distribution(df[col])
print(nm)
print(pm)
However, I get:
Error on: <scipy.stats._multivariate.multinomial_gen object at 0x000002E3CCFA9F40>
Error on: <scipy.stats._discrete_distns.bernoulli_gen object at 0x000002E3CCEF4040>
norm
(4.4, 7.002856560004639)
My expected output would be something like, for each column:
Gaussian SSE: <val> | Multinomial SSE: <val> | Bernoulli SSE: <val>
UPDATE Catching the error yields:
Error on: <scipy.stats._multivariate.multinomial_gen object at 0x000002E3CCFA9F40>
'multinomial_gen' object has no attribute 'fit'
Error on: <scipy.stats._discrete_distns.bernoulli_gen object at 0x000002E3CCEF4040>
'bernoulli_gen' object has no attribute 'fit'
Why am I getting errors? I think it is because multinomial
and bernoulli
do not have fit
methods. How can I make a fit method, and integrate that to get the SSE?? The target output of this function or program would be, for a
Gaussian, Multinomial, Bernoulli' distributions, what is the average SSE, per column in the df
, for each distribution type (to try and determine best-fit by column).
UPDATE 06/15: I have added a bounty.
UPDATE 06/16:
The larger intention, as this is a piece of a larger application, is to discern, over the course of a very large dataframe, what the most common distribution of tfidf values is. Then, based on that, apply a Naive Bayes classifier from sklearn that matches that most-common distribution. scikit-learn.org/stable/modules/naive_bayes.html contains details on the different classifiers. Therefore, what I need to know, is which distribution is the best fit across my entire dataframe, which I assumed to mean, which was the most common amongst the distribution of tfidf values in my words. From there, I will know which type of classifier to apply to my dataframe. In the example above, there is a column not shown called class
which is a positive
or negative
classification. I am not looking for input to this, I am simply following the instructions I have been given by my lead.