1

Using scipy, I'd like to get a measure of how likely it is that a random variable was generated by my log-normal distribution.

To do this I've considered looking at how far it is from the maximum of the PDF.

My approach so far is this: If the variable is r = 1.5, and the distribution σ=0.5, find the value from the PDF, lognorm.pdf(r, 0.5, loc=0). Given the result, (0.38286..), I would then like to look up what area of the PDF is below 0.38286...

How can this last step be implemented? Is this even the right way to approach this problem?

To give a more general example of the problem. Say someone tells me they have 126 followers on twitter. I know that Twitter followers are a log-normal distribution, and I have the PDF of that distribution. Given that distribution do I determine how believable this number of followers is?

lyschoening
  • 18,170
  • 11
  • 44
  • 54
  • Have you seen [this question](http://stackoverflow.com/questions/8870982/how-do-i-get-a-lognormal-distribution-in-python-with-mu-and-sigma)? (It might help.) – Andy Hayden Sep 20 '12 at 09:47
  • To be honest I don't quite understand what that question is about. I have no problem getting an accurate result from a PDF, but want to ask what fraction of the PDF would get that value or lower. – lyschoening Sep 20 '12 at 09:57
  • What you are looking for is precisely the CDF, see my answer. :) (To be honest I didn't really understand that question either, but it did help!) – Andy Hayden Sep 20 '12 at 10:00

2 Answers2

2

The area under the PDF is the CDF (which is conveniently a method in lognorm):

lognorm.cdf(r, 0.5, loc=0)

.

One thing you can use this to calculate is the Folded Cumulative Distribution (mentioned here), also known as a "mountain plot":

FCD = 0.5 - abs(lognorm.cdf(r, 0.5, loc=0) - 0.5)
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • I know about the CDF, but it doesn't solve my problem. Say the PDF of r=1.5 is ~0.38, the CDF is ~0.79. The PDF of r=0.405 is also ~0.38, but the CDF is ~0.03. I want a distribution that gives both low and high outliers the same value. – lyschoening Sep 20 '12 at 10:04
  • @lyschoening could you take `abs(CDF-0.5)`? – Andy Hayden Sep 20 '12 at 10:25
  • I could take the `[CDF at the low side] + (1 - [CDF at the high side])`, so `0.03 + (1-0.79) = 0.24` in this example. – lyschoening Sep 20 '12 at 10:34
  • @lyschoening Neither of these things have the *meaning* of likelihood "a random variable was generated by my log-normal distribution". To be honest, I don't think there is an answer to that particular question (besides 0, which presumably you wouldn't consider an answer). – Andy Hayden Sep 20 '12 at 10:56
  • I feared something like that. I guess I have to rephrase the problem I'm trying to solve. – lyschoening Sep 20 '12 at 11:01
  • @lyschoening The answer will almost surely be 0! :) – Andy Hayden Sep 20 '12 at 11:04
  • I had another look at your `abs(CDF-0.5)` suggestion. I was too quick to discard it. This here does what I want: `1-abs(cdf(r)-0.5)*2` – lyschoening Sep 20 '12 at 11:38
  • @lyschoening I was going to suggest just that! Is this the Folded cumulative Distribution, I'll update my answer to mention this. – Andy Hayden Sep 20 '12 at 11:47
0

same result as hayden's

For statistical tests with an asymmetric distribution, we get the pvalue by taking the minimum of the two tail probabilities

>>> r = 1.5
>>> 0.5 - abs(lognorm.cdf(r, 0.5, loc=0) - 0.5) 
0.20870287338447135
>>> min((lognorm.cdf(r, 0.5), lognorm.sf(r, 0.5)))
0.20870287338447135

This is usually doubled to get the two-sided p-value, but there are some recent papers that suggest alternatives to the doubling.

Josef
  • 21,998
  • 3
  • 54
  • 67