1

I've just started working with R and trying to find out how to add mean and median labels on a box plot using ggplot.
I have a dataset: Unit, Quarter, # of Days:

dset <- read.table(text='Unit     Quarter  Days   Z  
HH       1Q      25  Y      
PA       1Q      28  N     
PA       1Q      10  Y     
HH       1Q      53  Y
HH       1Q      12  Y
HH       1Q      20  Y
HH       1Q      43  N
PA       1Q      11  Y
PA       1Q      66  Y
PA       1Q      54  Y      
PA       2Q      19  N
PA       2Q      46  Y
PA       2Q      37  Y
HH       2Q      22  Y      
HH       2Q      67  Y      
PA       2Q      45  Y
HH       2Q      48  Y
HH       2Q      15  N
PA       3Q      12  Y               
PA       3Q      53  Y      
HH       3Q      58  Y
HH       3Q      41  N
HH       3Q      18  Y
PA       3Q      26  Y
PA       3Q      12  Y
HH       3Q      63  Y
                   ', header=TRUE)

I need to show data by Unit and Quarter and create a boxplot displaying mean and median values.
My code for a boxplot:

ggplot(data = dset, aes(x = Quarter
                       ,y = Days, fill = Quarter))  +
  geom_boxplot(outlier.shape = NA) + 
  facet_grid(. ~ Unit) + # adding another dimension
  coord_cartesian(ylim = c(10, 60)) + #sets the y-axis limits
  stat_summary(fun.y=mean, geom="point", shape=20, size=3, color="red", fill="red") + #adds average dot
  geom_text(data = means, aes(label = round(Days, 1), y = Days + 1), size = 3) + #adds average labels
  geom_text(data = medians, aes(label = round(Days, 1), y = Days - 0.5), size = 3) + #adds median labels
  xlab(" ") +
  ylab("Days") +
  ggtitle("Days") +
  theme(legend.position = 'none')

I can use geom_text function to add mean and median labels but only for one dimension ("Quarter") and it requires calculation of mean and median variables beforehand:

means <- aggregate(Days ~  Quarter, dset, mean)
medians <- aggregate(Days ~  Quarter, dset, median)

It works pretty good and I managed to calculate mean and median values by both "Unit" and "Quarter":

means <- aggregate(dset[, 'Days'], list('Unit' = dset$Unit, 'Quarter' = dset$Quarter), mean)
medians <- aggregate(dset[, 'Days'], list('Unit' = dset$Unit, 'Quarter' = dset$Quarter), median)

but I do not know how to pass those variables to geom_text function to display lables for the mean and median. Maybe I should calculate mean and median in a different way or there are other options how to add those labels.
Would be grateful for any suggestions!

Tart
  • 305
  • 1
  • 6
  • 20
  • Have a look at this https://stackoverflow.com/questions/19876505/boxplot-show-the-value-of-mean – prosoitos Oct 29 '18 at 22:55
  • @prosoitos I've read that post before but it doesn't tell you how to get labels if you use two dimensions, in my case "Unit" and "Quarter". – Tart Oct 30 '18 at 13:54
  • Also nothing about adding labels to both median and mean – Tart Oct 30 '18 at 14:11
  • Oh, sorry. I thought it would be useful – prosoitos Oct 30 '18 at 15:04
  • any help with could you help with another corresponding question? What should I change `means <- aggregate(dset[, 'Days'], list('Unit' = dset$Unit, 'Quarter' = dset$Quarter), mean)` if I want to find a mean for a subset of the data using another column "Z"? – Tart Oct 31 '18 at 14:46
  • I tried `means <- aggregate(subset(dset[, 'Days', 'Z'], Z=="Y"), list('Unit' = dset$Unit, 'Quarter' = dset$Quarter), mean)` but it doesn't work. – Tart Oct 31 '18 at 14:46

2 Answers2

6

Looks like the problem is that when you calculate the mean and median values by both "Unit" and "Quarter" the variable the used to be called "Days" is in now called "x". So simply update your geom_text commands to reflect this.

ggplot(data = dset, aes(x = Quarter, y = Days, fill = Quarter))  +
  geom_boxplot(outlier.shape = NA) + 
  facet_grid(. ~ Unit) + # adding another dimension
  coord_cartesian(ylim = c(10, 60)) + #sets the y-axis limits
  stat_summary(fun.y=mean, geom="point", shape=20, size=3, color="red", fill="red") + #adds average dot
  geom_text(data = means, aes(label = round(x, 1), y = x + 1), size = 3) + #adds average labels
  geom_text(data = medians, aes(label = round(x, 1), y = x - 0.5), size = 3) + #adds median labels
  xlab(" ") +
  ylab("Days") +
  ggtitle("Days") +
  theme(legend.position = 'none')
Sarah
  • 135
  • 7
  • That's awesome! Thank you Sarah! – Tart Oct 30 '18 at 14:22
  • could you help with another corresponding question? What should I change `means <- aggregate(dset[, 'Days'], list('Unit' = dset$Unit, 'Quarter' = dset$Quarter), mean)` if I want to find a mean for a subset of the data using another column "Z"? – Tart Oct 31 '18 at 14:41
  • I tried `means <- aggregate(subset(dset[, 'Days', 'Z'], Z=="Y"), list('Unit' = dset$Unit, 'Quarter' = dset$Quarter), mean)` but it doesn't work... – Tart Oct 31 '18 at 14:44
  • I think this is what you're looking for: – Sarah Oct 31 '18 at 19:08
1

In answer to your second question, I think you are looking for something like this. This code produces the same chart but restricting to the subsample Z = Y.

means <- aggregate(dset[, 'Days'][dset$Z=="Y"], list('Unit' = dset$Unit[dset$Z=="Y"], 'Quarter' = dset$Quarter[dset$Z=="Y"]), mean)
    medians <- aggregate(dset[, 'Days'][dset$Z=="Y"], list('Unit' = dset$Unit[dset$Z=="Y"], 'Quarter' = dset$Quarter[dset$Z=="Y"]), median)

ggplot(data = dset[dset$Z=="Y",], aes(x = Quarter, y = Days, fill = Quarter))  +
  geom_boxplot(outlier.shape = NA) + 
  facet_grid(. ~ Unit) + # adding another dimension
  coord_cartesian(ylim = c(10, 60)) + #sets the y-axis limits
  stat_summary(fun.y=mean, geom="point", shape=20, size=3, color="red", fill="red") + #adds average dot
  geom_text(data = means, aes(label = round(x, 1), y = x + 1), size = 3) + #adds average labels
  geom_text(data = medians, aes(label = round(x, 1), y = x - 0.5), size = 3) + #adds median labels
  xlab(" ") +
  ylab("Days") +
  ggtitle("Days") +
  theme(legend.position = 'none')
Sarah
  • 135
  • 7
  • This also works great, thank you! Though, I noticed that if I have "NA" values in column Z, the boxplot shows 3 graphs: 2 for Units and 1 for "NA". Also I managed to write another option by creating a temp table: `tdf <- subset(dset, Z=="Y", select = c('Days', 'Unit', 'Quarter'))` and then just using 'tdf' table instead of 'dset': `means <- aggregate(tdf[, 'Days'], list('Unit' = tdf$Unit, 'Quarter' = tdf$Quarter), mean)` – Tart Oct 31 '18 at 20:27
  • As for avoiding a graph with "NA" records, I changed this line a bit: `ggplot(data = subset(dset, Z=="Y"), aes(x = Quarter, y = Days, fill = Quarter))...` – Tart Oct 31 '18 at 20:30