0

I am generating split violin plots using the geom_split_violin function created here: Split violin plot with ggplot2.

Then, I add labels for sample sizes (n = ...) for each split violin. However, unfortunately the labels overlap. How could I please move them slightly to the left and right, so that they do not overlap?

Here is the full code that I am using and below it the result with overlapping "n = ..." labels.

 # Create data
 set.seed(20160229)
 my_data = data.frame(
     y=c(rnorm(500), rnorm(300, 0.5), rnorm(400, 1), rnorm(200, 1.5)),
     x=c(rep('a', 800), rep('b', 600)),
     m=c(rep('i', 300), rep('j', 700), rep('i', 400)))
 # Code to create geom_split_violin function from link above
 library('ggplot2')
 GeomSplitViolin <- ggproto("GeomSplitViolin", GeomViolin, 
                       draw_group = function(self, data, ..., draw_quantiles = NULL) {
    data <- transform(data, xminv = x - violinwidth * (x - xmin), xmaxv = x + violinwidth * (xmax - x))
   grp <- data[1, "group"]
   newdata <- plyr::arrange(transform(data, x = if (grp %% 2 == 1) xminv else xmaxv), if (grp %% 2 == 1) y else -y)
   newdata <- rbind(newdata[1, ], newdata, newdata[nrow(newdata), ], newdata[1, ])
   newdata[c(1, nrow(newdata) - 1, nrow(newdata)), "x"] <- round(newdata[1, "x"])
   if (length(draw_quantiles) > 0 & !scales::zero_range(range(data$y))) {
     stopifnot(all(draw_quantiles >= 0), all(draw_quantiles <=
       1))
     quantiles <- ggplot2:::create_quantile_segment_frame(data, draw_quantiles)
     aesthetics <- data[rep(1, nrow(quantiles)), setdiff(names(data), c("x", "y")), drop = FALSE]
     aesthetics$alpha <- rep(1, nrow(quantiles))
     both <- cbind(quantiles, aesthetics)
     quantile_grob <- GeomPath$draw_panel(both, ...)
     ggplot2:::ggname("geom_split_violin", grid::grobTree(GeomPolygon$draw_panel(newdata, ...), quantile_grob))
   }
   else {
     ggplot2:::ggname("geom_split_violin", GeomPolygon$draw_panel(newdata, ...))
   }
 })
 geom_split_violin <- function(mapping = NULL, data = NULL, stat = "ydensity", position = "identity", ..., 
                               draw_quantiles = NULL, trim = TRUE, scale = "area", na.rm = FALSE, 
                               show.legend = NA, inherit.aes = TRUE) {
   layer(data = data, mapping = mapping, stat = stat, geom = GeomSplitViolin, 
         position = position, show.legend = show.legend, inherit.aes = inherit.aes, 
         params = list(trim = trim, scale = scale, draw_quantiles = draw_quantiles, na.rm = na.rm, ...))
 }
 # Add labels 'n = ...'
 give_n = function(x, y_lo = min(my_data$y)) {
      data.frame(y = y_lo * 1.06,
              label = paste("n =", length(x)))
 }
 # Plot data
 ggplot(my_data, aes(x, y, fill = m)) + 
      geom_split_violin() + 
      stat_summary(fun.data = give_n, aes(x = as.factor(x)), geom = "text")

Result (note overlapping 'n = ...' labels): enter image description here

Sylvia Rodriguez
  • 1,203
  • 2
  • 11
  • 30

1 Answers1

2

Does adding position_nudge() solve your problem?

ggplot(my_data, aes(x, y, fill = m)) + 
  geom_split_violin() + 
  stat_summary(fun.data = give_n, aes(x = as.factor(x)), geom = "text",
               position = position_nudge(x = c(-0.25, 0.25)))

enter image description here

teunbrand
  • 33,645
  • 4
  • 37
  • 63
  • Thank you. This solves the problem for the question I posted. In my real dataset, I unfortunately still have a problem, because the "n=300" and "n=500" labels are sometimes switched and then it is no longer correct. I understand that you would need the data to help with that issue too. However, may I ask you, do you know based on what the -0.25 and +0.25 nudge is decided? Is it based on the first variable in the column? So, on whether i or j occurs first in column "a"? I can try to edit my question, if you cannot answer based on this information only. Thank you. – Sylvia Rodriguez Aug 12 '19 at 05:39
  • 1
    Yes it is based on the order in which the data occurs during processing. Statistics are usually calculated group-wise and setting a `fill` is equivalent to setting a group. If this is difficult to predict, you could use `layer_data(your_plot, 2)` to see where things end up, `2` being the layer number you wish to see. – teunbrand Aug 12 '19 at 05:56
  • Thank you. I will give this a try. I played round with `arrange()` from the `dplyr` package, but that did not solve the problem. I will try your suggestion. My real data are confidential, so I cannot post my real code, but if this does not work, I will try to reproduce the error in above dataset and edit my question above. Thanks again! – Sylvia Rodriguez Aug 12 '19 at 06:06
  • Thank you, I figured out that the problem is caused by some half-violins having sample size = 0 (no samples). From there on, it is incorrect. I'll now try to think about a solution. Thanks again. – Sylvia Rodriguez Aug 12 '19 at 06:14