3

I have a data set which contains values from 0 to 1e-5. I guess the data can be described by lognormal distribution. So I use scipy.stats.lognorm to fit my data and want to plot the origin data and the fitting distribution on a same figure by using matplotlib.

Firstly, I plot the sample by histogram:
enter image description here

Then, I add the fitting distribution by line plot. However, this will change the Y-axis to a very large number:
enter image description here

So the origin data (sample) cannot be seen on the figure!

I've check all variables and I found that the variable pdf_fitted is so large (>1e7). I really don't understand why a simple fit scistats.lognorm.fit to a sample that was generated by the same distribution scistats.lognorm.pdf doesn't work. Here is the codes to demonstrate my problem:

from matplotlib import pyplot as plt
from scipy import stats as scistats
import numpy as np

# generate a sample for x between 0 and 1e-5
x = np.linspace(0, 1e-5, num=1000)
y = scistats.lognorm.pdf(x, 3, loc=0, scale=np.exp(10))
h = plt.hist(y, bins=40) # plot the sample by histogram
# plt.show()

# fit the sample by using Log Normal distribution
param = scistats.lognorm.fit(y)
print("Log-normal distribution parameters : ", param)
pdf_fitted = scistats.lognorm.pdf(
    x, *param[:-2], loc=param[-2], scale=param[-1])
plt.plot(x, pdf_fitted, label="Fitted Lognormal distribution")
plt.ticklabel_format(style='sci', scilimits=(-3, 4), axis='x')
plt.legend()
plt.show()
Cœur
  • 37,241
  • 25
  • 195
  • 267
yoursbh
  • 175
  • 3
  • 13
  • You forgot to specify the x-values of the `plot(x_values,y_values)` – ImportanceOfBeingErnest Dec 18 '18 at 18:20
  • matplotlib has [documentation](https://matplotlib.org/gallery/api/two_scales.html) for plotting with different scales – G. Anderson Dec 18 '18 at 19:16
  • Hi, that can only change the X-axis in figure. But, I still don't understand why `pdf_fitted` is so large ... – yoursbh Dec 18 '18 at 22:15
  • Your whole approach is wrong. Your histogram isn't a histogram of data drawn from a lognormal distribution, and you're fitting to garbage data, which produces a garbage fit. There's an example of the correct way to get what you want using your current parameters in my answer below. – tel Dec 18 '18 at 22:17

1 Answers1

4

The problem

The immediate problem that you're having is that your fit is really, really bad. You can see this if you set the x and y scale on the plot to log, like with plt.xscale('log') and plt.yscale('log'). This lets you see both your histogram and your fitted data on a single plot:

enter image description here

so it's off by many orders of magnitude in both directions.

The fix

Your whole approach to generating a sample from the probability distribution represented by stats.lognorm and fitting it was wrong. Here's a correct way to do it, using the same parameters for the lognorm distribution that you supplied in your question:

from matplotlib import pyplot as plt
from scipy import stats as scistats
import numpy as np

plt.figure(figsize=(12,7))
realparam = [.1, 0, np.exp(10)]

# generate pdf data around the mean value
m = realparam[2]
x = np.linspace(m*.6, m*1.4, num=10000)
y = scistats.lognorm.pdf(x, *realparam)

# generate a matching random sample
sample = scistats.lognorm.rvs(*realparam, size=100000)
# plot the sample by histogram
h = plt.hist(sample, bins=100, density=True)

# fit the sample by using Log Normal distribution
param = scistats.lognorm.fit(sample)
print("Log-normal distribution parameters : ", param)
pdf_fitted = scistats.lognorm.pdf(x, *param)
plt.plot(x, pdf_fitted, lw=5, label="Fitted Lognormal distribution")
plt.legend()
plt.show()

Output:

Log-normal distribution parameters :  (0.09916091013245995, -215.9562383088556, 22245.970148671593)

enter image description here

tel
  • 13,005
  • 2
  • 44
  • 62
  • So, does this means that I can't use lognormal to fit a sample that has small values ? But I do really generate the sample by lognormal distribution `scistats.lognorm.pdf`, I don't understand why the inverse way `scistats.lognorm.fit` doesn't work. – yoursbh Dec 18 '18 at 22:23
  • You seem to have a fundamental misunderstanding of what a pdf is. A pdf is not a sample. Taking a small slice of the pdf (which is how you're getting your `y` data) is not the same thing as taking a sample from the pdf. Taking the integral of the pdf over a given region tells you the probability that a sample will be drawn from that region. That is what a pdf is. In order to take samples from the distributions in `scipy.stats`, you has to use their `rvs` method. – tel Dec 18 '18 at 22:46
  • Hi tel, thanks to your clear answer, I've marked your post as the solution and add more details about my true confusion in my origin post *_* – yoursbh Dec 19 '18 at 22:12