11

I'm attempting to create a three-level logistic regression model in pymc3. There is a top level, mid level, and an individual level, where the mid-level coefficients are estimated from top-level coefficients. I'm having difficulty specifying the proper data structure for the mid level, however.

Here's my code:

with pm.Model() as model:
    # Hyperpriors
    top_level_tau = pm.HalfNormal('top_level_tau', sd=100.)
    mid_level_tau = pm.HalfNormal('mid_level_tau', sd=100.)    

    # Priors
    top_level = pm.Normal('top_level', mu=0., tau=top_level_tau, shape=k_top)
    mid_level = [pm.Normal('mid_level_{}'.format(j),
                           mu=top_level[mid_to_top_idx[j]],
                           tau=mid_level_tau)
                 for j in range(k_mid)]

    intercept = pm.Normal('intercept', mu=0., sd=100.)

    # Model prediction
    yhat = pm.invlogit(mid_level[mid_to_bot_idx] + intercept)

    # Likelihood
    yact = pm.Bernoulli('yact', p=yhat, observed=y)

I'm getting the error "only integer arrays with one element can be converted to an index" (on line 16), which I think is related to the fact that the mid_level variable is a list, not a proper pymc Container. (I don't see the Container class in the pymc3 source code, either.)

Any help would be appreciated.

Edit: Adding some mock data

y = np.array([0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0])
mid_to_bot_idx = np.array([0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 2, 3, 3, 2, 2, 3, 2, 2, 3, 3, 3, 3, 2, 2, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, 2, 2, 3, 3, 2, 2, 3, 2])
mid_to_top_idx = np.array([0, 0, 1, 1])
k_top = 2
k_mid = 4

Edit #2:

There seem to be a few different ways to solve this issue, although none are completely satisfactory:

1) One can reframe the model as:

with pm.Model() as model:
    # Hyperpriors
    top_level_tau = pm.HalfNormal('top_level_tau', sd=100.)
    mid_level_tau = pm.HalfNormal('mid_level_tau', sd=100.)    

    # Priors
    top_level = pm.Normal('top_level', mu=0., tau=top_level_tau, shape=k_top)
    mid_level = pm.Normal('mid_level', mu=0., tau=mid_level_tau, shape=k_top)
    intercept = pm.Normal('intercept', mu=0., sd=100.)

    # Model prediction
    yhat = pm.invlogit(top_level[top_to_bot_idx] + mid_level[mid_to_bot_idx] + intercept)

    # Likelihood
    yact = pm.Bernoulli('yact', p=yhat, observed=y)

This seems to work, although I can't figure out how to extend it to the case where the mid-level variance is not constant for all of the mid-level groups.

2) One can wrap the mid-level coefficients into a Theano tensor using theano.tensor.stack: i.e.,

import theano.tensor as tt
mid_level = tt.stack([pm.Normal('mid_level_{}'.format(j),
                           mu=top_level[mid_to_top_idx[j]],
                           tau=mid_level_tau)
                 for j in range(k_mid)])

But this seems to run very slowly on my actual data set (30k observations), and it makes plotting inconvenient (each of the mid_level coefficients gets its own trace using pm.traceplot).

Anyway, some advice/input from the developers would be appreciated.

vbox
  • 384
  • 2
  • 10
  • 1
    @gung Does it look ok now? – vbox Nov 29 '16 at 17:50
  • Thanks, that's great. Questions about coding in Python are off topic here, but can be on topic on [SO]. If you wait, we will try to migrate your question there. – gung - Reinstate Monica Nov 29 '16 at 17:57
  • 3
    I disagree that this is off-topic: this is not a generic python coding question. This question is about implementing a statistical model with a mature python statistical package -- the answer could well be to represent the model in a different way. – JBWhitmore Nov 29 '16 at 18:52
  • 1
    I do believe this question belongs to stats.stackexchange.com – Vladislavs Dovgalecs Nov 29 '16 at 23:50
  • There's no predictors in your model, should it be `yhat = pm.invlogit(top_level[top_to_bot_idx] * x + mid_level[mid_to_bot_idx] * x + intercept)`? – maxymoo Dec 06 '17 at 23:11

3 Answers3

5

Turns out the solution was simple: it appears that any distribution (like pm.Normal) can accept a vector of means as an argument, so replacing this line

mid_level = [pm.Normal('mid_level_{}'.format(j),
                       mu=top_level[mid_to_top_idx[j]],
                       tau=mid_level_tau)
             for j in range(k_mid)]

with this

mid_level = pm.Normal('mid_level',
                       mu=top_level[mid_to_top_idx],
                       tau=mid_level_tau,
                       shape=k_mid)

works. The same method can also be used to specify individual standard deviations for each of the mid-level groups.

EDIT: Fixed typo

vbox
  • 384
  • 2
  • 10
1

Few changes (note I changed yhat to theta):

theta = pm.Deterministic( 'theta', pm.invlogit(sum(mid_level[i] for i in mid_to_bot_idx)+intercept) )
yact = pm.Bernoulli( 'yact', p=theta, observed=y )
Vladislavs Dovgalecs
  • 1,525
  • 2
  • 16
  • 26
  • I don't think this does what I want (although I could be mistaken). This seems to sum all of the coefficients to get a single value of theta for all observations. I want different thetas for each observation, depending on the top_level and the mid_level. In other words, theta should be a vector of the same length as y, not a scalar. For instance, see this model: http://nbviewer.jupyter.org/github/aflaxman/pymc-examples/blob/master/seeds_re_logistic_regression_pymc3.ipynb – vbox Nov 30 '16 at 20:52
  • @vbox Yes, I tend to agree with you. Your original code had mid_level array simply re-ordered (and some values duplicated) by the mid_to_bot_idx array. I implemented exactly as it was shown in your code. – Vladislavs Dovgalecs Nov 30 '16 at 21:35
  • Usually the argument to invlogit is something like (x * beta + intercept) where x are features, beta are regression coefficients and intercept is the bias term. – Vladislavs Dovgalecs Nov 30 '16 at 21:36
0

In your question, you have declared yhat. You can avoid it and pass the equation to logit_p parameter of Bernoulli.

Note - You can pass either p or logit_p.

In my case using logit_p speed up my sampling process.

Code-

# Likelihood
yact = pm.Bernoulli('yact', logit_p=top_level[top_to_bot_idx] + mid_level[mid_to_bot_idx] + intercept, observed=y)
Ashok Rayal
  • 405
  • 3
  • 16