18

What is the proper way to format a categorical predictor to use in STAN? I cannot seem to input a categorical predictor as a normal factor variable, so what is the quickest way to transform a normal categorical variable such that Stan can accept it?

For example, say I had a a continue predictor and a categorical predictor

your_dataset = data.frame(income = c(62085.59, 60806.33, 60527.27, 67112.64, 57675.92, 58128.44, 60822.47, 55805.80, 63982.99, 64555.45),
country = c("England", "England", "England", "USA", "USA", "USA", "South Africa", "South Africa", "South Africa", "Belgium"))

Which looks like this:

     income      country
1  62085.59      England
2  60806.33      England
3  60527.27      England
4  67112.64          USA
5  57675.92          USA
6  58128.44          USA
7  60822.47 South Africa
8  55805.80 South Africa
9  63982.99 South Africa
10 64555.45      Belgium

How would I prepare this to be entered in rstan?

Mario Becerra
  • 514
  • 1
  • 6
  • 16
goldisfine
  • 4,742
  • 11
  • 59
  • 83

2 Answers2

25

It is correct that Stan only inputs real or integeger variables. In this case, you want to convert a categorical predictor into dummy variables (perhaps excluding a reference category). In R, you can do something like

dummy_variables <- model.matrix(~ country, data = your_dataset)

Which will look like this

   (Intercept) countryEngland countrySouth Africa countryUSA
1            1              1                   0          0
2            1              1                   0          0
3            1              1                   0          0
4            1              0                   0          1
5            1              0                   0          1
6            1              0                   0          1
7            1              0                   1          0
8            1              0                   1          0
9            1              0                   1          0
10           1              0                   0          0
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$country
[1] "contr.treatment"

However, that might not come out to the right number of observations if you have unmodeled missingness on some other variables. This approach can be taken a step farther by inputting the entire model formula like

X <- model.matrix(outcome ~ predictor1 + predictor2 ..., data = your_dataset)

Now, you have an entire design matrix of predictors that you can use in a .stan program with linear algebra, such as

data {
  int<lower=1> N;
  int<lower=1> K;
  matrix[N,K]  X;
  vector[N]    y;
}
parameters {
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(X * beta, sigma); // likelihood
  // priors
}

Utilizing a design matrix is recommended because it makes your .stan program reusable with different variations of the same model or even different datasets.

Mario Becerra
  • 514
  • 1
  • 6
  • 16
Ben Goodrich
  • 4,870
  • 1
  • 19
  • 18
  • Hi, this is an "indicator variable" approach. How would you do it for an "index variable"? Inicator variable: is k-1 dummy variables, index variable: one dummy variable with k levels – Lefty Oct 16 '20 at 10:33
  • @Lefty : See the alternative answer. – Ben Goodrich Oct 19 '20 at 19:20
  • Don't we need to marginalize the categorical variables and use target += ... ? Some examples on the documentation say that. – skan Oct 07 '22 at 12:21
  • Do we need to marginalize the categorical predictor variables? Or only if they are latent variables? https://mbjoseph.github.io/posts/2020-04-28-a-step-by-step-guide-to-marginalizing-over-discrete-parameters-for-ecologists-using-stan/ – skan Apr 23 '23 at 00:15
8

Another approach is to use an index variable, in which case the Stan program would look like

data {
  int<lower = 1> N; // observations
  int<lower = 1> J; // levels
  int<lower = 1, upper = J> x[N];
  vector[N] y;      // outcomes
}
parameters {
  vector[J] beta;
  real<lower = 0> sigma;
}
model {
  y ~ normal(beta[x], sigma); // likelihood
  // priors 
}

and you would pass the data from R to Stan like

list(N = nrow(my_dataset),
     J = nlevels(my_dataset$x),
     x = as.integer(my_dataset$x),
     y = my_dataset$y)
Ben Goodrich
  • 4,870
  • 1
  • 19
  • 18
  • The problem with this is that we lose the factor levels, the parameters will be called `beta[1]` etc. It'd be nice if there was a way to retain that (or at least conveniently re-introduce it when analysing the posterior samples etc). – RobinGower Jan 30 '23 at 12:50
  • Sorry for hijacking the conversation here, but how would this approach work if you also have a continuous predictor? I can't find a way to calculate the expected mean, given it doesn't accept a vector of slopes – LeoRJorge Feb 01 '23 at 15:51