10

I'm trying to keep "empty space" for multi-level grouped boxplots.

set.seed(42)
n <- 100
dat <- data.frame(x=runif(n),
                  cat1=sample(letters[1:4], size=n, replace=TRUE),
                  cat2=sample(LETTERS[1:3], size=n, replace=TRUE))
ggplot(dat, aes(cat1, x)) + geom_boxplot(aes(fill=cat2))

enter image description here

If I force one of the groups to be empty:

dat <- subset(dat, ! (cat1 == 'b' & cat2 == 'B'))
table(dat$cat1, dat$cat2)
##    
##      A  B  C
##   a  9  9  7
##   b  8  0  5
##   c 13 11  6
##   d 11 10  5
ggplot(dat, aes(cat1, x)) + geom_boxplot(aes(fill=cat2))

enter image description here

The second group, "b", is now expanded to fill the space. What I'd like is:

enter image description here

SO 9818835 (forcing an empty level to appear) works fine on the top level, but I can't figure out how to get it to work for a second level of categories. in scale_x_discrete(...), I tried setting:

  • breaks=letters[1:4]
  • breaks=LETTERS[1:3]
  • breaks=list(letters[1:4], LETTERS[1:3]) (a stab)
  • breaks=NULL
  • breaks=func where func <- function(x, ...) { browser(); 1; } in order to troubleshoot; it only offered letters[1:4] and never prompted for the second level

Using interactions(letters[1:4], LETTERS[1:3]) still does not leave empty space. I tried a workaround by injecting an out-of-bounds x value and forcing it off the screen with scale_y_continuous(limits), but ggplot2 is too smart for me and closes the gap again.

Are there elegant (i.e., "correct" in ggplot2 mechanisms) solutions?

Community
  • 1
  • 1
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • How elegant does it need to be? Just setting `x` to zero for these records seems to create something that looks quite reasonable. `dat <- dat %>% mutate(x = ifelse(cat1 == 'b' & cat2 == 'B', 0, x))` – akhmed Oct 21 '15 at 22:12
  • 1
    That's elegant programmatically (and I had already tried it, sans the `scale_y_continuous(limits)` step), but I'm a little OCD when it comes to my visualizations: I'll always stare at the distracting line at the bottom of the plot. – r2evans Oct 21 '15 at 22:16
  • Plus there's the statistical difference between "a line means no data" and "a line indicates a single data point of value 0". – r2evans Oct 21 '15 at 22:18
  • I see. So the main issue is the extra line. Why not use `coord_cartesian` then instead of `scale_y_continuous`? – akhmed Oct 21 '15 at 22:24

1 Answers1

9

Could coord_cartesian be a solution that you are looking for?

It will zoom in and will not try to "outsmart" the data like scale_y_continuous

library(dplyr)
library(ggplot2)

set.seed(42)
n <- 100
dat <- data.frame(x=runif(n),
                  cat1=sample(letters[1:4], size=n, replace=TRUE),
                  cat2=sample(LETTERS[1:3], size=n, replace=TRUE))

LARGE_VALUE <- 2

dat <- dat %>%
  mutate(x = ifelse(cat1 == 'b' & cat2 == 'B', 
                    LARGE_VALUE,
                    x))

ggplot(dat, aes(cat1, x)) + 
  geom_boxplot(aes(fill=cat2)) + 
  coord_cartesian(ylim = c(0,1))

enter image description here

akhmed
  • 3,536
  • 2
  • 25
  • 35
  • Yup, that does what I need. I'm a little surprised that `drop=FALSE` doesn't do what I think it would do, but then again I need to work more to fully grok `ggplot2`. Thanks. – r2evans Oct 21 '15 at 22:30