How to interpret R linear regression when there are multiple factor levels as the baseline?

Question

My data has 3 independent variables, all of which are categorical:

condition: cond1, cond2, cond3

population: A,B,C

task: 1,2,3,4,5

The dependent variable is the task completion time. I run lm(time~condition+user+task,data) in R and get the following results:

enter image description here

What confuses me is that cond1, groupA, and task1 are left out from the results. From the thread linear regression "NA" estimate just for last coefficient, I understand that one factor level is chosen as the "baseline" and shown in the (Intercept) row.

But what if there are multiple factor levels used as the baseline, as in the above case?

Does the (Intercept) row now indicates cond1+groupA+task1?
What if I want to know the coefficient and significance for cond1, groupA, and task1 individually?
For example, groupB has an estimated coefficient +9.3349, compared to groupA? Or compared to cond1+groupA+task1?

This is more likely related to Statistics, try http://stats.stackexchange.com/ — zx8754, Feb 10 '14 at 12:42

score 5 · Accepted Answer · answered Feb 10 '14 at 13:00

5

One person of your population must have one value for each variable 'condition', 'population' and 'task', so the baseline individual must have a value for each of this variables; in this case, cond1, A and t1. All of the results are based over the ideal (mean) individual with these independent variables, so the intercept do give the mean value of time for cond1, groupA and task1.

The significance or coefficient for cond1, groupA or task1 makes no sense, as significance means significant different mean value between one group and the reference group. You can not compare the reference group against itself.

As your model has no interactions, the coefficient for groupB means that the mean time for somebody in population B will be 9.33(seconds?) higher than the time for somebody in population A, regardless of the condition and task they are performing, and as the p-value is very small, you can stand that the mean time is in fact different between people in population B and people in the reference population (A). If you added an interaction term to the model, these terms (for example usergroupB:taskt4) would indicate the extra value added (or substracted) to the mean time if an individual has both conditions (in this example, if an individual is from population B and has performed task 4). These effects would be added to the marginal ones (usergroupB and taskt4).

Hope I helped.

answered Feb 10 '14 at 13:00

Rufo

524
1
3
18

2

I don't know why this got a downvote. So, I gave it an upvote. – Roland Feb 10 '14 at 13:32
“B is 9.33 higher than A, regardless of the condition and task they are performing”. This seems to contradict the other answers so far, which suggest that B is higher than A under condition1 and task1? – Ida Feb 10 '14 at 14:11
@Ida: B is 9.33 time units higher than A under any condition and task, as it is an overall effect . I'm sorry, but the other answers may be a little misleading in this aspect. – Rufo Feb 10 '14 at 14:48
@Roland: Thanks for the upvote :) A comment about your answer (thanks to Ida). You say `It's the difference between cond1/task1/groupA and cond1/task1/groupB.`. It may also be the difference between cond3/task4/groupA and cond3/task4/groupB` (the other covariates are the same, but not necessarily the baseline ones). – Rufo Feb 10 '14 at 14:53
Thanks. I've tried to clarify in my answer. But of course the OP should have a look at appropriate textbooks. – Roland Feb 10 '14 at 15:24
@Rufo Maybe I am wrong. Could you please explain why it is an overall effect. I suppose the effect holds for the baseline levels only. Thanks. – Sven Hohenstein Feb 10 '14 at 19:00
1

@SvenHohenstein: Practical case. Let's predict the mean Y (time) for two people with covariates a) c1/t1/gA and b) c1/t1/gB and for two people with c) c3/t4/gA and d) c3/t4/gB. -a)E[Y]=16.59 (only the Intercept term) -b)E[Y]=16.59+9.33 (Intercept+groupB) -c)E[Y]=16.59-0.27-14.61 (Intercept+cond1+task1) -d)E[Y]=16.59-0.27-14.61+9.33 (Intercept+cond1+task1+groupB) The mean difference between a) and b) is the groupB term, 9.33 seconds. The mean difference between c) and d) is also the groupB term, 9.33 seconds. A main term is always the added effect of this term known the rest of covariates. – Rufo Feb 11 '14 at 11:14
@Rufo Thanks. OK, now I see your point. :) – Sven Hohenstein Feb 11 '14 at 16:01
+1 for mentioning interaction terms. Still, I wonder if for an interaction term groupB:task4, I got a +0.3 coefficient and a p-value<0.05, should I explain it as groupB+task4 increase the time a) compared to groupA+task1? b) compared to groupA+task1+cond1? or c) as an overall effect? – Ida Feb 12 '14 at 09:31
It's the added effect of the combination of the two covariates groupB and task4. People from group B is 9.3 seconds slower than people from group A, people performing task 4 is 14.6 seconds faster than people performing task 1 except (because of the interaction term) group B people, who are only 14.3 (14.6-0.3) seconds faster. It's like saying, group B is specially slow in task 4. It is in fact compared to groupB+task4 if the task wouldn't affect differently over groups. BTW, the coefficient is just +0.3 and p-value is significant? – Rufo Feb 12 '14 at 09:44
So, it is proper to conclude that in general, group B is specially slow for task 4, right? (the coefficient +0.3 is just something that I made up as an example) – Ida Feb 12 '14 at 10:57

Roland · Answer 2 · 2014-02-10T15:23:01.893

Does the (Intercept) row now indicates cond1+groupA+task1?

Yes.

What if I want to know the coefficient and significance for cond1, groupA, and task1 individually?

Think about what significance means. You need to formulate a hypothesis. In your example everything is compared to the intercept and your question doesn't really make sense. However, you can always conduct pairwise comparisons between all possible effect combinations (see package multcomp).

For example, groupB has an estimated coefficient +9.3349, compared to groupA? Or compared to cond1+groupA+task1?

It's the difference between cond1/task1/groupA and cond1/task1/groupB. (As @Rufo correctly points out, it is of course an overall effect and actually the difference between groupB and groupA provided the other effects are equal.)

Sven Hohenstein · Answer 3 · 2014-02-10T18:30:02.790

By default, R uses treatment contrasts for categorial variables. Hence, the first level is treated as the base level. All remaining levels are compared with the base level.

Your base levels are cond1 for condition, A for population, and 1 for task. All coefficients are estimated in relation to these base levels.

The intercept is just the mean of the response variable in the three base levels.

For example, the effect conditioncond2 is the difference between cond2 and cond1 where population is A and task is 1. Hence, the coefficients do not tell you anything about an overall difference between conditions, but in the data related to the base levels only. (Analogously, conditioncond3 is the difference between cond3 and cond1.)

The same is true for the other factors. The effects of population hold for condition cond1 and task 1 only. The effects of task hold for condition cond1 and population A only.

How to interpret R linear regression when there are multiple factor levels as the baseline?

3 Answers3