I am currently trying to use PyMC for determining the parameters of a power law fit for given data. I am using the pdf formula taken from:
A. Clauset, C. R. Shalizi, and M. E. J. Newman, "Power-law distributions in empirical data," Siam rev., vol. 51, iss. 4, pp. 661-703, 2009.
For generating sample data with specific given parameters for testing my code I use the following Python powerlaw package that implements the methods of Clauset et al.:
https://pypi.python.org/pypi/powerlaw
My code works pretty well if I use a fixed xmin value (i.e., the lower boundary for which the power law function holds). However, as soon as I try to also determine the xmin value, the output produces too high values for xmin. I have commented the corresponding xmin parts out:
test = powerlaw.Power_Law(xmin = 1., parameters = [1.5])
simulated = test.generate_random(1000)
fit = powerlaw.Fit(simulated, xmin=1.)
print fit.alpha
print fit.xmin
xmin = 1.
#alpha = mc.Uniform('alpha', 0,6, value=1.5)
alpha = mc.Exponential('alpha', 1.5)
#xmin = mc.Uniform('xmin', min(simulated), max(simulated), value=min(simulated))
#xmin = mc.Exponential('xmin', 1.)
#print xmin.value
@mc.stochastic(observed=True)
def power_law(value=simulated, alpha=alpha, xmin=xmin):
#value = value[value >= xmin]
return np.sum(np.log((alpha-1) * xmin**(alpha-1) * value**-alpha))
model = mc.MCMC([alpha,xmin,power_law])
model.sample(iter=5000)
print(model.stats()['alpha']['mean'])
#print(model.stats()['xmin']['mean'])
alpha_samples = model.trace('alpha')[:]
#xmin_samples = model.trace('xmin')[:]
figsize(12.5,10)
ax = plt.subplot(311)
ax.set_autoscaley_on(False)
plt.hist(alpha_samples, histtype='stepfilled', bins=20, label="posterior of alpha", color="#A60628", normed=True)
plt.legend(loc="upper left")
plt.xlim([0,2])
plt.xlabel("alpha value")
#plt.subplot(312)
#plt.hist(xmin_samples, histtype='stepfilled', bins=20,
# label="posterior of xmin", color="#A60628", normed=True)
#plt.legend(loc="upper left")
#plt.xlim([0,500])
#plt.xlabel("xmin value")
I think one problem is that I should always only consider the data >= xmin in my power_law function. If I do so, I get the "correct" alpha values when I also determine xmin, but still xmin is way too high. I also have a feeling that this is an unfair comparison then, as the data samples to look at in the MCMC process would be differently sized and hence, also the likelihood comparison are biased.
Maybe someone has an idea of how I could cope with this issue.
UPDATE: My current code is available here: http://www.philippsinger.info/notebooks/pl_pymc.html