get count with stat_frequency

Question

I have a routine to create some plots with ggplot :

getPlotList = function(param.list, data=db, y, color){
  param.list %>% sapply(function(var){
    ggplot(data=data, aes(x=data[[var]], y=data[[y]], color=data[[color]]))+
      stat_summary(fun.y = mean, fun.ymin = function(x){mean(x) - sem(x)}, fun.ymax = function(x){mean(x) + sem(x)}, geom = "errorbar", width=.1, position = position_dodge(0.3), na.rm = TRUE) +
      stat_summary(fun.y = mean, geom = "point", position = position_dodge(0.3), na.rm = TRUE) +
      ylim(0, NA) +
  }, simplify = FALSE, USE.NAMES = TRUE)
}

Which I use like this :

c("col1", "col2", "col3") %>% getPlotList(y="col4", color="col5")

This works perfectly (I have dozens of plots to write), and give a result like this (but without the n=... labels) :

The thing is, my count is the same for every color, but it can change with x.
Since there are errorbars (which won't show if n=1 or n=0), I have to show the count in labels, like I did on the picture (with Paint).

There are a lots of similar questions on SO (like this one, this one, this one, etc...), but all use geom_hist or geom_bar, which happen to have the ..count.. metavariable available, unlike the stat_summary I'm using.

How could I add those labels ?

PS : I tried to use quosures instead of data[[...]] in my function but failed miserably. This is not the main part of the question but if anybody has an idea this would help me quite much.

It's easier to help you if you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample data that can be used for testing. — MrFlick, Jan 10 '18 at 16:12

Mark Peterson · Accepted Answer · 2018-07-23T18:00:44.213

This is built using these sample data:

sampleData <-
  data.frame(
    col1 = factor(rep(LETTERS[1:4], c(12, 6, 16, 20)*5)
                  , levels = LETTERS[1:4])
    , col2 = factor(rep(LETTERS[1:4], c(1, 17, 16, 20)*5)
                    , levels = LETTERS[1:4])
    , col3 = factor(rep(LETTERS[1:4], c(0, 18, 16, 20)*5)
                    , levels = LETTERS[1:4])
    , col4 = rnorm(54*5, 4, 2)
    , col5 = factor(rep(1:5, 54))
  )

The basic approach is to simply add the label yourself manually. For that, I used table to count the occurrences of each X/color and generated a new data.frame to display those. Note that, while you say that each color within the X groupings always has the same sample size, it is better to program defensively. Instead of trusting that (and, e.g., using the counts for the first color), I use apply to get all of the unique values. As long as there is only one, the effect is the same. However, if there are more than one, this will give you an indication.

In addition, I went ahead and switched the mapping to use aes_string so that it will populate through your column labels. If you don't like that behavior, just override with ylab etc.

Similarly, the function sem was not found (I assume it is a custom function), so I used the mean_cl_normal function instead, which has the added advantage of utilizing the fun.data argument for cleaner code. (I also prefer confidence intervals to just showing SEM, but that is more style than substance).

getPlotList = function(param.list, data=db, y, color){
  param.list %>% sapply(function(var){

    myCounts <- table(data[[var]], data[[color]])

    forLabels <-
      data.frame(
        x = row.names(myCounts)
        , label = paste("n =", apply(myCounts, 1, function(x){paste(unique(x), collapse = ";")}))
        , y = 0.5
      )

    ggplot(data=data, aes_string(x=var, y=y, color=color))+
      stat_summary(fun.data = mean_cl_normal, position = position_dodge(0.3), na.rm = TRUE) +
      stat_summary(fun.y = mean, geom = "point", position = position_dodge(0.3), na.rm = TRUE) +
      ylim(0, NA) +
      geom_text(aes(x = x, y = y, label = label, color = NA)
                , forLabels
                , show.legend = FALSE)
  }, simplify = FALSE, USE.NAMES = TRUE)
}

Now, this code:

c("col1", "col2", "col3") %>% getPlotList(y="col4", color="col5", data = sampleData)

gives the following plots:

At the request of @Nettle, I modified the code to use a bit more of the tidyverse, specifically using Standard Evaluation to loop through the column list instead of using the base table approach from before. I believe that the code should function identically. The main advantage is removing the intermediate variables, though one could argue that those improve readability.

getPlotList <- function(param.list, data=db, y, color){
  param.list %>% sapply(function(var){

    ggplot(data=data, aes_string(x=var, y=y, color=color))+
      stat_summary(fun.data = mean_cl_normal, position = position_dodge(0.3), na.rm = TRUE) +
      stat_summary(fun.y = mean, geom = "point", position = position_dodge(0.3), na.rm = TRUE) +
      ylim(0, NA) +
      geom_text(aes_string(x = var, y = "y", label = "label", color = NA)
                , data %>%
                  count(!!as.name(var), !!as.name(color)) %>%
                  group_by(!!as.name(var)) %>%
                  summarise(
                    label = paste("n =", paste(unique(n), collapse = ";"))
                  ) %>%
                  mutate(y = 0.5)
                , show.legend = FALSE)

  }, simplify = FALSE, USE.NAMES = TRUE)
}

@Mark - any chance you could do this entirely in the tidyverse? — Nettle, Jul 21 '18 at 18:09
@Nettle, why? There is very little of it that is not in the tidyverse already. I suppose that you could change the creation of `forLabels` to a `summarise` statement (without the `myCounts` intermediate) if you really wanted. I tend to find that working with the standard evaluation steps (which you would need to loop through multiple columns) is a bit more finicky than I like, particularly when you are wrapping things in a function anyway. — Mark Peterson, Jul 21 '18 at 19:55
Alright @Nettle, I couldn't let go and this was a good excuse to play with the Standard Evaluation steps a bit more. I am still not sure why you wanted it (and I am genuinely curious), but there is now a version that keeps things more in the`tidyverse` — Mark Peterson, Jul 23 '18 at 18:01

get count with stat_frequency

1 Answers1