I am trying to find out what is the distribution of values of purchases done by consumers. It is Zero Inflated, as most of the consumers do not make any purchase in a given time constrain. I use python. As it is the value of items bought, my data is not desecrated like in Poisson distribution, but always nonnegative and continuous, which might mean log-normal, exponential, gamma, inverse gamma, etc. distributions
My question boils down to how to fit distributions to Zero Inflated data and check which fits better?
I have found quite a lot of information on how to make Zero Inflated Poisson regression, but my aim is to find out what is the underlying distribution of the process, not to make predictions, as I want to know the variance.
What are unknown:
- Probability of inflated zeros - not all zeros are inflated, as they might also be a result of the underlying distribution
- What is the family of distribution generating the purchase values
- What are the parameters of distribution generating the purchase values
I have created a sample code to generate example data and my attempt to fit two distributions. Unfortunately, SSE for the true one is higher than for the alternative.
import numpy as np
import pandas as pd
import scipy
from scipy import stats
import matplotlib.pyplot as plt
N = 1000 * 1000
p_of_inflated_zeros = 0.20
#generation of data
Data = pd.DataFrame({"Prob_bought" : np.random.uniform(0, 1, N) })
Data["If_bought"] = np.where(Data["Prob_bought"] > p_of_inflated_zeros , 1 , 0)
Data["Hipotetical_purchase_value"] = scipy.stats.expon.rvs(scale = 50, loc = 10, size = N)
#Data["Hipotetical_purchase_value"] = scipy.stats.lognorm.rvs(s = 1, scale = 50, loc = 10, size = N)
Data["Hipotetical_purchase_value"] = np.where(Data["Hipotetical_purchase_value"] < 0 ,0 , Data["Hipotetical_purchase_value"])
Data["Purchase_value"] = Data["If_bought"] * Data["Hipotetical_purchase_value"]
# fit distribiution
# based on https://stackoverflow.com/questions/6620471/fitting-empirical-distribution-to-theoretical-ones-with-scipy-python
#create
#x = np.linspace(min(gr_df_trans_tmp), max(gr_df_trans_tmp), 200)
y, x = np.histogram(Data["Purchase_value"], bins = 1000, density = True)
x = (x + np.roll(x, -1))[:-1] / 2.0
#lognormal
FIT_lognorm_sape, FIT_lognorm_loc, FIT_lognorm_scale = scipy.stats.lognorm.fit(Data["Purchase_value"])
FIT_lognorm_pdf = scipy.stats.lognorm.pdf(x, s = FIT_lognorm_sape, loc = FIT_lognorm_loc, scale = FIT_lognorm_scale)
SSE_lognorm = np.sum(np.power(y - FIT_lognorm_pdf, 2.0))
print(SSE_lognorm)
# 0.036408827144038584
#exponental
FIT_expo_loc, FIT_expo_scale = scipy.stats.expon.fit(Data["Purchase_value"])
FIT_expo_pdf = scipy.stats.expon.pdf(x, FIT_expo_loc, FIT_expo_scale)
SSE_expo = np.sum(np.power(y - FIT_expo_pdf, 2.0))
print(SSE_expo)
# 0.07564960702319487
# chart
# wykres histogram
axes = plt.gca()
axes.set_xlim([-2, 200])
plt.hist(Data["Purchase_value"], bins = 1000, alpha = 1, density = True)
# Plot the PDFs
plt.plot(x, FIT_lognorm_pdf, 'k', linewidth = 1, alpha = 0.5, color = 'red', label = 'lognormal')
plt.plot(x, FIT_expo_pdf, 'k', linewidth = 1, alpha = 0.5, color = 'blue', label = 'exponental')
plt.legend(loc='upper right', title = "")
plt.title("Fitting distribiution to ilustrativ data")
plt.xlabel("Hipotetical purchase value")
plt.ylabel('Density')