3

I'm using Jeff Alstott's Python powerlaw package to try fitting my data to a Power Law. Jeff's package is based on the paper by Clauset et al which discusses the Powerlaw.

First, some details on my data:

  1. It is discrete (word count data);
  2. It is heavily skewed to the left (high skewness)
  3. It is Leptokurtic (excess kurtosis is greater than 10)

What I have done so far

df_data is my Dataframe, where word_count is a Series containing word count data for around 1000 word tokens.

First I've generated a fit object:

fit = powerlaw.Fit(data=df_data.word_count, discrete=True)

Next, I compare the powerlaw distribution for my data against other distributions - namely, lognormal, exponential, lognormal_positive, stretched_exponential and truncated_powerlaw, with the fit.distribution_compare(distribution_one, distribution_two) method.

As a result of the distribution_compare method, I've obtained the following (r,p) tuples for each of the comparisons:

  • fit.distribution_compare('power_law', 'lognormal') = (0.35617607052907196, 0.5346696007)
  • fit.distribution_compare('power_law', 'exponential') = (397.3832646921206, 5.3999952097178692e-06)
  • fit.distribution_compare('power_law', 'lognormal_positive') = (27.82736434863289, 4.2257378698322223e-07)
  • fit.distribution_compare('power_law', 'stretched_exponential') = (1.37624682020371, 0.2974292837452046)
  • fit.distribution_compare('power_law', 'truncated_power_law') =(-0.0038373682383605, 0.83159372694621)

From the powerlaw documentation:

R : float

The loglikelihood ratio of the two sets of likelihoods. If positive, the first set of likelihoods is more likely (and so the probability distribution that produced them is a better fit to the data). If negative, the reverse is true.

p : float

The significance of the sign of R. If below a critical value (typically .05) the sign of R is taken to be significant. If above the critical value the sign of R is taken to be due to statistical fluctuations.

From the comparison results between powerlaw, exponential and lognormal distributions, I feel inclined to say that I have a powerlaw distribution.

Would this be a correct interpretation/assumption about the test results? Or perhaps I'm missing something?

born to hula
  • 1,274
  • 5
  • 18
  • 36

1 Answers1

5

First off, while the methods might have been developed by me, Cosma Shalizi, and Mark Newman, our implementation is in Matlab and R. The python implementation I think you're using could be from Jeff Alstott or Javier del Molino Matamala or maybe Joel Ornstein (all of these are available off my website).

Now, about the results. A likelihood ratio test (LRT) does not allow you to conclude that you do or do not have a power-law distribution. It's only a model comparison tool, meaning it evaluates whether the power law is a less terrible fit to your data than some alternative. (I phrase it that way because an LRT is not a goodness of fit method.) Hence, even if the power-law distribution is favored over all the alternatives, it doesn't mean your data are power-law distributed. It only means that the power-law model is a less terrible statistical model of the data than the alternatives are.

To evaluate whether the power-law distribution itself is a statistically plausible model, you should compute the p-value for the fitted power-law model, using the semi-parametric bootstrap we describe in our paper. If p>0.1, and the power-law model is favored over the alternatives by the LRT, then you can conclude relatively strong support for your data following a power-law distribution.

Back to your specific results: each of your LRT comparisons produces a pair (r,p), where r is the normalized log likelihood ratio and p is the statistical significance of that ratio. The thing that is being tested for the p-value here is whether the sign of r is meaningful. If p<0.05 for a LRT, then a positive sign indicates the power-law model is favored. Looking at your results, I see that the exponential and lognormal_positive alternatives are worse fits to the data than the power-law model. However, the lognormal, stretched_exponential, and truncated_power_law are not, meaning these alternatives are just as terrible fits to the data as your power-law model.

Without the p-value from the hypothesis test for the power-law model itself, the LRT results are not fully interpretable. But even a partial interpretation is not consistent with a strong degree of evidence for a power-law pattern, since two non-power-law models are just as good (bad) as the power law for these data. The fact that the exponential model is genuinely worse than the power law is not surprising considering how right-skewed your data are, so nothing to write home about there.

aaronclauset
  • 176
  • 2
  • Hi @aaronclauset. Thank you so much for your comments - bit of an honor have your feedback on my issue. For the sake of correctness I've updated the question. – born to hula Mar 15 '18 at 15:44
  • (cont.) Just to be on the same page. So even if the result from the hypothesis test for the power-law shows a p-value that is enough for rejecting the null hypothesis, the fact that the LRT is inconclusive for power-law versus some distributions would prevent me from stating that power-law would be a good fit with enough certainty. Is this assumption correct? Thanks in advance! – born to hula Mar 15 '18 at 16:00
  • Going a bit into further detail - considering the results for my LRT tests, and supposing that a KS test for power-law gives me p > 0.1, would I be able to conclude that I have at least moderate support for saying that power-law is a good fit for my distribution? – born to hula Mar 15 '18 at 16:04
  • Happy to help! If the hypothesis test for the power-law alone returns p>0.1, then it's okay to say that your data are plausibly power-law distributed. (The word "plausibly" is chosen on purpose, since it implies a little bit of empirical uncertainty.) But, even in that case, if the LRT says some non-power-law distributions are just as good a fit as the power law, then that weakens the case that your data are definitely power-law distributed. The reason is that lognormals and stretched exponentials can also make data that *look* like power laws. – aaronclauset Mar 15 '18 at 16:33
  • Thanks for the quick response Aaron! I've used Joel Ornstein's plpva.py library in order to calculate the p-value. As a result of running plpva I got p = 0.9 and gof = 0.003. As far as I understand, the null-hypothesis for the KS test (which is implemented in plpva) is that the distributions are the same - the lower my p value, the greater the evidence I would have to reject the null hypothesis and conclude the distributions are different. But would result allow me to say my data is plausably power-law distributed? – born to hula Mar 15 '18 at 17:06
  • If plpva returns p>0.1, then yes, by convention, you could say that the data in the upper tail (x>=xmin) are plausibly power-law distributed. That doesn't say that the power-law model is the *best* model of these data, but it does say that it is statistically plausible that they were drawn iid from the fitted power-law distribution. – aaronclauset Mar 20 '18 at 04:33
  • Great! Thank you so much! Cheers – born to hula Mar 20 '18 at 17:49