How to represent a categorical predictor rstan?

Question

What is the proper way to format a categorical predictor to use in STAN? I cannot seem to input a categorical predictor as a normal factor variable, so what is the quickest way to transform a normal categorical variable such that Stan can accept it?

For example, say I had a a continue predictor and a categorical predictor

your_dataset = data.frame(income = c(62085.59, 60806.33, 60527.27, 67112.64, 57675.92, 58128.44, 60822.47, 55805.80, 63982.99, 64555.45),
country = c("England", "England", "England", "USA", "USA", "USA", "South Africa", "South Africa", "South Africa", "Belgium"))

Which looks like this:

     income      country
1  62085.59      England
2  60806.33      England
3  60527.27      England
4  67112.64          USA
5  57675.92          USA
6  58128.44          USA
7  60822.47 South Africa
8  55805.80 South Africa
9  63982.99 South Africa
10 64555.45      Belgium

How would I prepare this to be entered in rstan?

score 25 · Accepted Answer · edited Feb 18 '21 at 17:32

It is correct that Stan only inputs real or integeger variables. In this case, you want to convert a categorical predictor into dummy variables (perhaps excluding a reference category). In R, you can do something like

dummy_variables <- model.matrix(~ country, data = your_dataset)

Which will look like this

   (Intercept) countryEngland countrySouth Africa countryUSA
1            1              1                   0          0
2            1              1                   0          0
3            1              1                   0          0
4            1              0                   0          1
5            1              0                   0          1
6            1              0                   0          1
7            1              0                   1          0
8            1              0                   1          0
9            1              0                   1          0
10           1              0                   0          0
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$country
[1] "contr.treatment"

However, that might not come out to the right number of observations if you have unmodeled missingness on some other variables. This approach can be taken a step farther by inputting the entire model formula like

X <- model.matrix(outcome ~ predictor1 + predictor2 ..., data = your_dataset)

Now, you have an entire design matrix of predictors that you can use in a .stan program with linear algebra, such as

data {
  int<lower=1> N;
  int<lower=1> K;
  matrix[N,K]  X;
  vector[N]    y;
}
parameters {
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(X * beta, sigma); // likelihood
  // priors
}

Utilizing a design matrix is recommended because it makes your .stan program reusable with different variations of the same model or even different datasets.

Hi, this is an "indicator variable" approach. How would you do it for an "index variable"? Inicator variable: is k-1 dummy variables, index variable: one dummy variable with k levels — Lefty, Oct 16 '20 at 10:33
Don't we need to marginalize the categorical variables and use target += ... ? Some examples on the documentation say that. — skan, Oct 07 '22 at 12:21
Do we need to marginalize the categorical predictor variables? Or only if they are latent variables? https://mbjoseph.github.io/posts/2020-04-28-a-step-by-step-guide-to-marginalizing-over-discrete-parameters-for-ecologists-using-stan/ — skan, Apr 23 '23 at 00:15

score 8 · Answer 2 · answered Oct 19 '20 at 19:20

8

Another approach is to use an index variable, in which case the Stan program would look like

data {
  int<lower = 1> N; // observations
  int<lower = 1> J; // levels
  int<lower = 1, upper = J> x[N];
  vector[N] y;      // outcomes
}
parameters {
  vector[J] beta;
  real<lower = 0> sigma;
}
model {
  y ~ normal(beta[x], sigma); // likelihood
  // priors 
}

and you would pass the data from R to Stan like

list(N = nrow(my_dataset),
     J = nlevels(my_dataset$x),
     x = as.integer(my_dataset$x),
     y = my_dataset$y)

answered Oct 19 '20 at 19:20

Ben Goodrich

4,870
1
19
18

The problem with this is that we lose the factor levels, the parameters will be called `beta[1]` etc. It'd be nice if there was a way to retain that (or at least conveniently re-introduce it when analysing the posterior samples etc). – RobinGower Jan 30 '23 at 12:50
Sorry for hijacking the conversation here, but how would this approach work if you also have a continuous predictor? I can't find a way to calculate the expected mean, given it doesn't accept a vector of slopes – LeoRJorge Feb 01 '23 at 15:51

How to represent a categorical predictor rstan?

2 Answers2