2

Just trying to understand how geom_abline works with facets in ggplot.

I have a dataset of student test scores. These are in a data table dt with 4 columns:

student: unique student ID
cohort:  grouping factor for students (A, B, … H)
subject: subject of the test (English, Math, Science)
score:   the test score for that student in that subject

The goal is to compare cohorts. The following snippet creates a sample dataset.

library(data.table)
## cohorts: list of cohorts with number of students in each
cohorts <- data.table(name=toupper(letters[1:8]),size=as.numeric(c(8,25,16,30,10,27,13,32)))
## base: assign students to cohorts
base    <- data.table(student=c(1:sum(cohorts$size)),cohort=rep(cohorts$name,cohorts$size))
## scores for each subject
english <- data.table(base,subject="English", score=rnorm(nrow(base), mean=45, sd=50))
math    <- data.table(base,subject="Math",    score=rnorm(nrow(base), mean=55, sd=25))
science <- data.table(base,subject="Science", score=rnorm(nrow(base), mean=70, sd=25))
## combine
dt      <- rbind(english,math,science)
## clip scores to (0,100)
dt$score<- (dt$score>=0) * dt$score
dt$score<- (dt$score<=100)*dt$score + (dt$score>100)*100

The following displays mean score by cohort with 95% CL, facetted by subject, and includes a (blue, dashed) reference line (using geom_abline).

library(ggplot2)
library(Hmisc)
ggp <- ggplot(dt,aes(x=cohort, y=score)) + ylim(0,100)
ggp <- ggp + stat_summary(fun.data="mean_cl_normal")
ggp <- ggp + geom_abline(aes(slope=0,intercept=mean(score)),color="blue",linetype="dashed")
ggp <- ggp + facet_grid(subject~.)
ggp

The problem is that the reference line (from geom_abline) is the same in all facets (= the grand average score for all students and all subjects). So stat_summary seems to respect the grouping implied in facet_grid (e.g., by subject), but abline does not. Can anyone explain why?

NB: I realize this problem can be solved by creating a separate table of group means and using that as the data source in geom_abline (below), but why is this necessary?

means <- dt[,list(mean.score=mean(score)),by="subject"]
ggp <- ggplot(dt,aes(x=cohort, y=score)) + ylim(0,100)
ggp <- ggp + stat_summary(fun.data="mean_cl_normal")
ggp <- ggp + geom_abline(data=means, aes(slope=0,intercept=mean.score),color="blue",linetype="dashed")
ggp <- ggp + facet_grid(subject~.)
ggp
jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • I don't know the answer, but I think your problem has something to do with using `mean` inside `aes`. You are summarizing many y-values into a single value, but I think that the `geom_*` functions don't work like this. Try replacing your `geom_abline` call with `geom_point(aes(y=mean(score)),color="blue")` and compare it with `geom_point(aes(y=score),color="blue")`. This may help in your debugging process. You may also want to look at the last example in the `geom_hline` documentation. – kdauria Nov 14 '13 at 19:15

2 Answers2

3

This should do what you want. The stat_* functions use different collections of data for each facet. I think any expressions in the aes of the geom_* functions are intended to be used for the transformation of each y-value.

ggplot(dt,aes(x=cohort, y=score)) +
       stat_summary(fun.data="mean_cl_normal") + 
       stat_smooth(formula=y~1,aes(group=1),method="lm",se=FALSE) +
       facet_grid(subject~.) + ylim(0,100)

enter image description here

kdauria
  • 6,300
  • 4
  • 34
  • 53
  • This is a huge improvement over what I was doing. Setting se=TRUE (the default) in stat_smooth(...) , displays not only the group (subject) mean, but also the 95% CL for each group. This allows comparison of the cohorts not only to each other, but also to the mean for the subject, something that cannot be done with geom_abline. Use of formula=y~1 is very clever... – jlhoward Nov 14 '13 at 21:57
0

As golbasche mentioned, I would have probably done something more like this:

dt <- dt[,avg_score := mean(score),by = subject]

ggplot(dt,aes(x=cohort, y=score)) + 
    facet_grid(subject~.) + 
    stat_summary(fun.data="mean_cl_normal") +
    geom_hline(aes(yintercept = avg_score),color = "blue",linetype = "dashed") + 
    ylim(0,100)
joran
  • 169,992
  • 32
  • 429
  • 468