9

Suppose we want to do a simple 'descriptive model of income.' Suppose we have three groups, North, Central, and South (think US regions). Comparing otherwise similar groups, suppose average income in the North is 130, Central is 80, and South is 60. Suppose equal group sizes, so mean is 90.

There should be a way, in a (linear regression) model, to report coefficients as differences from overall means (in a multivariate context, ‘all else equal’) and get one for each:

$\beta_{North} = 40$

$\beta_{Central} = -10$

$\beta_{South} = -30$

And skip the intercept, obviously.

This seems most intuitive to me. But I can’t for the life of me figure out how to get this with R’s ‘contrast coding’. (And also, that seems to mess up the variable names).

Set parameters for my simulation/mwe

m_inc <- 90
b_n <- 40
b_c  <- -10
b_s <- -30

sd_prop <- 0.5 #sd as share of mean
pop_per <- 1000 

Simulated data


set.seed(100)

n_income <- rnorm(pop_per, m_inc + b_n, (m_inc + b_n)*sd_prop)
c_income <- rnorm(pop_per, m_inc + b_c, (m_inc + b_s)*sd_prop)
s_income <- rnorm(pop_per, m_inc + b_s, (m_inc + b_s)*sd_prop)

noise_var <- rnorm(pop_per*3, 0, (m_inc + b_s)*sd_prop)

i_df <- tibble(
  region = rep( c("n", "c", "s"), c(pop_per, pop_per, pop_per) ),
  income = c(n_income, c_income, s_income),
  noise_var
) %>% 
  mutate(region = as.factor(region))


i_df %>%                               # Summary by group using purrr
  split(.$region) %>%
  purrr::map(summary)

Looks close enough.

Now I want to 'model income' to examine the differences by region 'controlling for other factors'. To illustrate the issue, let's make the south the base group. I set the default contr.treatment in case you have reset it.


i_df <- i_df %>%   mutate(region = relevel(region, ref="s"))
options(contrasts = rep ("contr.treatment", 2))


(
  basic_lm <- i_df %>% lm(income ~ region + noise_var, .)
)

The standard thing: intercept is (approximately) the mean for the 'base group', the south, and the coefficients regionc and regionn represent the relative adjustments for these, roughly +20 and +70.

This is standard 'dummy coding', i.e., 'treatment coding', the default i R.

We can adjust this default (for unordered variables) to something called 'sum contrast coding' for both unordered and ordered

options(contrasts = rep ("contr.sum", 2))

(
  basic_lm_cc <- i_df %>% lm(income ~ region + noise_var, .)
)

Now this seems to get us the adjustment coefficients we are looking for, but

  1. The names of the regions are lost; how do I know which is which?
  2. It is apparently reporting the adjustment coefficients for s (south) and c (central). Not very intuitive.

This seems to be the case no matter how we relevel the region to set a particular base group (I tried it) ... the coefficients don't change.

I found a solution, but it's not the "right way". I de-mean the outcome (income) variable, and force a 0 intercept:


i_df %>% 
  mutate(m_inc = mean(income)) %>% 
 lm(income - m_inc ~ 0  + region + noise_var, .)

Yay! This is what I wanted, and miraculously the variable names are preserved. But it seems a weird way to do it. Also note that, with the above code, this set of coefficients appears no matter which contrast matrix I force, whether sum or treatment.

How can this be done the 'right way' with contrast coding or another tool?

daaronr
  • 507
  • 1
  • 4
  • 12

1 Answers1

9

We can't get contrasts coding to work for this purpose. In one-way ANOVA, contrasts are used to code an N-level factor to N - 1 variables plus one intercept, so that's still N variables to N variables. But trying to include both the grand mean and deviations of group means in the model is a reparametrization from N variables to N + 1 variables. This (even if we find a way to do it) makes the design matrix rank-deficient, and lm/aov/glm, etc, will put one variable to NA.

In general, we have to do follow-up statistical analysis. In this answer, I will summarize what contrasts coding does, and shows how to compare group mean and the grand mean, in 4 ways: coding ourselves, using multcomp, emmeans and car.

library(ggplot2)
library(car)
library(multcomp)
library(emmeans)

Setup

I am going to use an example that is similar to yours.

setup

SimData <- function (group.size, group.mean, group.variance) {
  ## number of groups
  ng <- length(group.size)
  if (ng > 5) stop("There is no need to experiment that many groups!")
  ## number of observations per group
  n <- sum(group.size)
  ## generate a factor variable 'f' for these groups
  f <- rep.int(factor(sprintf("G%d", 1:ng)), group.size)
  ## simulate samples from each group
  mu <- rep.int(group.mean, group.size)
  se <- rep.int(sqrt(group.variance), group.size)
  y <- rnorm(n, mu, se)
  ## numerical covariate 'x' with slope = 1
  lim <- sd(y)
  br <- seq.int(-lim, lim, length.out = ng + 1)
  interval <- cbind(br[-(ng + 1)], br[-1])
  interval <- interval[sample.int(ng), ]
  a <- rep.int(interval[, 1], group.size)
  b <- rep.int(interval[, 2], group.size)
  x <- runif(n, a, b)
  ## create data.frame
  data.frame(y = y + x, f = f, x = x)
}

set.seed(4891738)  ## my Stack Overflow ID
group.size <- c(100, 125, 150)
group.mean <- c(130, 80, 60)
group.variance <- 0.25 * group.mean
dat <- SimData(group.size, group.mean, group.variance)

ggplot(data = dat, mapping = aes(x = x, y = y, colour = f)) + geom_point()

data

Naive calculation of mean and variance for each group is misleading, as we are far off the truth!

## true values are 130, 80, 60
with(dat, tapply(y, f, mean))
##        G1        G2        G3 
## 148.36985  60.38273  59.45486

## true values are 32.5, 20 and 15
with(dat, tapply(y, f, var))
##       G1       G2       G3 
## 67.37108 55.25185 41.69867

Contrasts Coding

intuitive form

actual form

## treatment coding
contr.treatment(3)
##   2 3
## 1 0 0
## 2 1 0
## 3 0 1

## sum-to-zero coding
contr.sum(3)
##   [,1] [,2]
## 1    1    0
## 2    0    1
## 3   -1   -1

why summary table does not display all levels

fit.treatment <- lm(y ~ f + x, dat, contrasts = list(f = "contr.treatment"))
coef(fit.treatment)
#(Intercept)         fG2         fG3           x 
#  128.94609   -48.71525   -69.03433     1.03121 

summary(fit.treatment)
anova(fit.treatment)

fit.sum <- lm(y ~ f + x, dat, contrasts = list(f = "contr.sum"))
coef(fit.sum)
#(Intercept)          f1          f2           x 
#  89.696234   39.249860   -9.465391    1.031210 

summary(fit.sum)
anova(fit.sum)

Note that although using different contrasts coding give different regression coefficients, they are actually equivalent by yielding the same fitted values.

all.equal(fit.treatment$fitted.values, fit.sum$fitted.values)
## [1] TRUE

Comparing Group Means with the Grand Mean

linear hypothesis tests 1

linear hypothesis tests 2

Linear Hypothesis Tests in R

1. Vanilla Approach Without Fancy Packages

vanilla

## 3 x 2 linear combination matrix
wt.treatment <- matrix(c(-1, 2, -1, -1, -1, 2), nrow = 3) / 3
wt.sum <- matrix(c(1, 0, -1, 0, 1, -1), nrow = 3)

vanilla <- function (wt, beta.ind, lmfit) {
  ## beta coefficients and their covariance matrix
  beta <- coef(lmfit)[beta.ind]
  V <- vcov(lmfit)[beta.ind, beta.ind]
  ## linear combination and their standard errors
  MEAN <- c(wt %*% beta)
  ## get standard errors for sum of `LinearComb`
  SE <- sqrt(diag(wt %*% tcrossprod(V, wt)))
  ## perform t-test
  tscore <- MEAN / SE
  pvalue <- 2 * pt(abs(tscore), lmfit$df.residual, lower.tail = FALSE)
  ## return a matrix
  ans <- matrix(c(MEAN, SE, tscore, pvalue), ncol = 4L)
  colnames(ans) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
  printCoefmat(ans)
}

vanilla(wt.treatment, 2:3, fit.treatment)
##       Estimate Std. Error t value  Pr(>|t|)    
## [1,]  39.24986    0.90825  43.215 < 2.2e-16 ***
## [2,]  -9.46539    0.89404 -10.587 < 2.2e-16 ***
## [3,] -29.78447    0.32466 -91.741 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

vanilla(wt.sum, 2:3, fit.sum)  ## identical to above

2. Using Package 'multcomp'

multcomp

## pad columns of zeros to zero out the effect of alpha and gamma
wt.treatment0 <- cbind(0, wt.treatment, 0)
wt.sum0 <- cbind(0, wt.sum, 0)

summary(glht(fit.treatment, linfct = wt.treatment0))
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)    
## 1 == 0  39.2499     0.9083   43.22   <2e-16 ***
## 2 == 0  -9.4654     0.8940  -10.59   <2e-16 ***
## 3 == 0 -29.7845     0.3247  -91.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

summary(glht(fit.sum, linfct = wt.sum0))  ## identical to above

3. Using Package ‘emmeans’

emmeans

emmeans(fit.treatment, specs = eff ~ f)
## $emmeans
##  f  emmean    SE  df lower.CL upper.CL
##  G1  127.3 1.002 371    125.4    129.3
##  G2   78.6 0.874 371     76.9     80.3
##  G3   58.3 0.379 371     57.5     59.0
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast  estimate    SE  df t.ratio p.value
##  G1 effect    39.25 0.908 371  43.215  <.0001
##  G2 effect    -9.47 0.894 371 -10.587  <.0001
##  G3 effect   -29.78 0.325 371 -91.741  <.0001
## 
## P value adjustment: fdr method for 3 tests

emmeans(fit.sum, specs = eff ~ f)  ## identical to above

Here, with specs = ~f or specs = "f", only the the $emmeans (marginal means) component is reported. The "eff" on the left hand side implies that eff.emmc() is to be called to apply contrasts, and the $contrasts component presents such result.

4. Using Package ‘car’

The linearHypothesis() function from cars performs F-tests to test if all linear combinations are simultaneously 0. So it is different from the t-tests in demonstrated above. Besides, it gives error:

linearHypothesis(fit.treatment, hypothesis.matrix = wt.treatment0)
#Error in solve.default(vcov.hyp) : 
#  system is computationally singular: reciprocal condition number = 1.12154e-17

linearHypothesis(fit.sum, hypothesis.matrix = wt.sum0)  ## identical to above

This answer is, in some sense, a summary of several my previous answers.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248