I'm using Jeff Alstott's Python powerlaw package to try fitting my data to a Power Law. Jeff's package is based on the paper by Clauset et al which discusses the Powerlaw.
First, some details on my data:
- It is discrete (word count data);
- It is heavily skewed to the left (high skewness)
- It is Leptokurtic (excess kurtosis is greater than 10)
What I have done so far
df_data is my Dataframe, where word_count is a Series containing word count data for around 1000 word tokens.
First I've generated a fit object:
fit = powerlaw.Fit(data=df_data.word_count, discrete=True)
Next, I compare the powerlaw distribution for my data against other distributions - namely, lognormal, exponential, lognormal_positive, stretched_exponential and truncated_powerlaw, with the fit.distribution_compare(distribution_one, distribution_two) method.
As a result of the distribution_compare method, I've obtained the following (r,p) tuples for each of the comparisons:
- fit.distribution_compare('power_law', 'lognormal') = (0.35617607052907196, 0.5346696007)
- fit.distribution_compare('power_law', 'exponential') = (397.3832646921206, 5.3999952097178692e-06)
- fit.distribution_compare('power_law', 'lognormal_positive') = (27.82736434863289, 4.2257378698322223e-07)
- fit.distribution_compare('power_law', 'stretched_exponential') = (1.37624682020371, 0.2974292837452046)
- fit.distribution_compare('power_law', 'truncated_power_law') =(-0.0038373682383605, 0.83159372694621)
From the powerlaw documentation:
R : float
The loglikelihood ratio of the two sets of likelihoods. If positive, the first set of likelihoods is more likely (and so the probability distribution that produced them is a better fit to the data). If negative, the reverse is true.
p : float
The significance of the sign of R. If below a critical value (typically .05) the sign of R is taken to be significant. If above the critical value the sign of R is taken to be due to statistical fluctuations.
From the comparison results between powerlaw, exponential and lognormal distributions, I feel inclined to say that I have a powerlaw distribution.
Would this be a correct interpretation/assumption about the test results? Or perhaps I'm missing something?