1

I know that are plenty of questions about the log-normal in scipy as this, this, this, and this But I still have doubts.

I'm trying to reproduce this example with SciPy, because I can understand the steps, but I'm not able to.

the data is:

from scipy.stats import lognorm, norm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

x = [20, 22, 25, 30, 60]
fig,ax = plt.subplots(1,1)
sns.kdeplot(x, color='blue',fill=False,ax=ax)

And I want to fit a log-normal:

shape_x, loc_x, scale_x = lognorm.fit(x,floc=0)
print(f'Estimated parameters for log-normal distribution of parameter x:')
print(f'Shape (s) of x: {shape_x}')
print(f'Location (loc) of x: {loc_x}')
print(f'Scale (scale) of x: {scale_x}')

accordingly to other questions in StackOverflow and scipy documentation, the mean and standard deviation should be:

mu_x = np.log(scale_x)
sigma_x = shape_x
print(f'Mean (μ) of x: {mu_x}')
print(f'Standard deviation (σ) of x: {sigma_x}')

Next, I try to create synthetic data with those parameters, to check:

synthetic_data_B = np.random.lognormal(mean=mu_x, sigma=sigma_x, size=len(x))
pdf_x = lognorm.pdf(x, s = shape_x, loc=loc_x, scale=scale_x)

fig,ax = plt.subplots(1,1)
sns.kdeplot(x, color='blue',fill=False,ax=ax)
sns.kdeplot(synthetic_data_B, color='red',fill=False,ax=ax)
ax.plot(x,pdf_x,color='green')

enter image description here

What I realize it:

  • The median in the article is the scale parameter from scipy.
  • The μ in the article is my mu_x = np.log(scale_x), but the σ is different, in the article is 0.437 and with scipy gives 0.391.
  • If I evaluate the mean with lognorm.mean(shape_x,loc_x,scale_x) it gives quite similar to the article.
  • If I evaluate the standard deviation with lognorm.std(shape_x,loc_x,scale_x), it gives different.

My questions are:

  • Why σ is different?

  • The synthetic data predicted with the fitted parameters do not match the original data, why?

  • If I try to do the opposite and to recover the x distribution from the fitted parameters, what I got is fairway from what should be.

  • How can I generate synthetic data to represent the real x?

Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64
JCV
  • 447
  • 1
  • 5
  • 15

1 Answers1

1

Problem is, what is returned from scipy fit does not include Bessel correction.

Very easy to check

import numpy as np
from scipy.stats import lognorm

x = [20, 22, 25, 30, 60]
shape_x, loc_x, scale_x = lognorm.fit(x,floc=0)
print(f'Estimated parameters for log-normal distribution of parameter x:')
print(f'Shape (s) of x: {shape_x}')
print(f'Location (loc) of x: {loc_x}')
print(f'Scale (scale) of x: {scale_x}')

mu_x = np.log(scale_x)
sigma_x = shape_x
print(f'Mean (μ) of x: {mu_x}')
print(f'Standard deviation (σ) of x: {sigma_x}')

lnx = np.log(x)
q = lnx-mu_x

t = np.sqrt(np.sum(q*q)/len(q))
print(t)

last line would print 0.3913832002383578, which is the same as scipy fit returns.

You could easily do backcheck with artificial samples:

 r = lognorm.rvs(sigma_x, loc=0.0, scale=scale_x, size=10000)
 shape_x, loc_x, scale_x = lognorm.fit(r, floc=0)
 print(shape_x, loc_x, scale_x)

for me it prints

0.3912912820809421 0 28.8544068486573

which is the same values as before (well, modulo statistical noise)

Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64