2

I am currently trying to use PyMC for determining the parameters of a power law fit for given data. I am using the pdf formula taken from:

A. Clauset, C. R. Shalizi, and M. E. J. Newman, "Power-law distributions in empirical data," Siam rev., vol. 51, iss. 4, pp. 661-703, 2009.

For generating sample data with specific given parameters for testing my code I use the following Python powerlaw package that implements the methods of Clauset et al.:

https://pypi.python.org/pypi/powerlaw

My code works pretty well if I use a fixed xmin value (i.e., the lower boundary for which the power law function holds). However, as soon as I try to also determine the xmin value, the output produces too high values for xmin. I have commented the corresponding xmin parts out:

test = powerlaw.Power_Law(xmin = 1., parameters = [1.5])
simulated = test.generate_random(1000)
fit = powerlaw.Fit(simulated, xmin=1.)
print fit.alpha
print fit.xmin

xmin = 1.

#alpha = mc.Uniform('alpha', 0,6, value=1.5) 
alpha = mc.Exponential('alpha', 1.5)
#xmin = mc.Uniform('xmin', min(simulated), max(simulated), value=min(simulated))
#xmin = mc.Exponential('xmin', 1.)
#print xmin.value

@mc.stochastic(observed=True)
def power_law(value=simulated, alpha=alpha, xmin=xmin):
    #value = value[value >= xmin]
    return np.sum(np.log((alpha-1) * xmin**(alpha-1) * value**-alpha))

model = mc.MCMC([alpha,xmin,power_law])
model.sample(iter=5000)

print(model.stats()['alpha']['mean'])
#print(model.stats()['xmin']['mean'])

alpha_samples = model.trace('alpha')[:]
#xmin_samples = model.trace('xmin')[:]

figsize(12.5,10)

ax = plt.subplot(311)
ax.set_autoscaley_on(False)

plt.hist(alpha_samples, histtype='stepfilled', bins=20, label="posterior of alpha", color="#A60628", normed=True)
plt.legend(loc="upper left")
plt.xlim([0,2])
plt.xlabel("alpha value")

#plt.subplot(312)

#plt.hist(xmin_samples, histtype='stepfilled', bins=20, 
#         label="posterior of xmin", color="#A60628", normed=True)
#plt.legend(loc="upper left")
#plt.xlim([0,500])
#plt.xlabel("xmin value")

enter image description here

I think one problem is that I should always only consider the data >= xmin in my power_law function. If I do so, I get the "correct" alpha values when I also determine xmin, but still xmin is way too high. I also have a feeling that this is an unfair comparison then, as the data samples to look at in the MCMC process would be differently sized and hence, also the likelihood comparison are biased.

Maybe someone has an idea of how I could cope with this issue.

UPDATE: My current code is available here: http://www.philippsinger.info/notebooks/pl_pymc.html

fsociety
  • 1,791
  • 4
  • 22
  • 32

1 Answers1

1

I think that you are correct that there is a problem in your likelihood when xmin is less than some data values. My solution would be to explicitly prohibit this case, by returning a log-likelihood of -np.inf when it arises:

@mc.stochastic(observed=True)
def power_law(value=simulated, alpha=alpha, xmin=xmin):
    if value.min() < xmin:
        return -np.inf
    return np.sum(np.log((alpha-1) * xmin**(alpha-1) * value**-alpha))

I also suggest using a burn-in period of half your total samples, and checking convergence graphically, as follows:

model.sample(iter=5000, burn=2500)
pm.Matplot.plot(model)

(See this SO answer for a PyMC3 example of how I like to use burn-in and graphical convergence checks.)

Community
  • 1
  • 1
Abraham D Flaxman
  • 2,969
  • 21
  • 39
  • Thanks for the conversion hint. I already figured this out myself. You misunderstood the problem with xmin though. What I mean is that with varying xmin values I have to look at varying data lengths. Do for the sequence [1,2,3] with xmin=1 vs xmin=2 I would look at distinct empirical data lengths. I commented my approach out in the code: `value = value[value >= xmin]`. The problem you point out might be problematic as well, I could solve this with uniform priors I think. – fsociety Jun 25 '14 at 17:41
  • I don't think your approach is a valid solution: when `xmin=2` you have data observations with probability 0 that you are ignoring. In this case, I think that returning `-np.inf` is the correct thing to do. As evidence, here is the posterior mean of `xmin` in my approach, compared to the powerlaw.Fit value: xmin: 1.0 vs 0.998160020389. This is [collected up in a notebook here](http://nbviewer.ipython.org/gist/aflaxman/d504a0862e7a4e337286). – Abraham D Flaxman Jun 25 '14 at 18:51
  • Okay, I think I get your approach now. It works really well also with higher `xmin` values. However, in this case you would always only consider potential `xmin` values below the minimum of the data which only can lead to the lowest value of the data to be the appropriate `xmin` value. This resembles the same result as if I would set `xmin` manually to the lowest value as I did before. The goal of power law fitting though is to get the most appropriate `xmin` value for which power law holds. This can be also some higher value. E.g., in `[1,2,3]` `xmin` could very well be 2. – fsociety Jun 25 '14 at 20:36
  • I think you will need a different model for that. The approach by Clauset, Shalizi, and Newman will always have `xmin <= data.min()`. – Abraham D Flaxman Jun 26 '14 at 14:26
  • That's not true. The Clauset approach is explicitly determined in finding the appropriate `xmin` value given empirical data. They use all possible `xmin` values, then limit the data to guarantee `xmin <= data.min()` and calculate the KS statistic for each `xmin`. The lowest distance then determines the appropriate `xmin` value. This is exactly the thing I want to achieve. – fsociety Jun 26 '14 at 14:48
  • Can you provide an example where `powerlaw.Fit` produces a different estimate of `xmin`? That will help me understand your claim. – Abraham D Flaxman Jun 30 '14 at 23:26
  • Sure. E.g., load the Moby Dick data `http://tuvalu.santafe.edu/~aaronc/powerlaws/data/words.txt` and do `powerlaw.Fit(data)`. You should get a best lower boundary of `xmin=7`. – fsociety Jul 01 '14 at 07:40
  • Thanks, this makes it clear that I am missing something in the CSN paper. I will investigate further when time permits. – Abraham D Flaxman Jul 03 '14 at 16:54