5

I am close to plotting what I wanted, but haven't quite figured out whether stat_summary is the right way to display the desired plot.

The desired output is the scatter plot with a median line for each year, within each category. For example, in the plot below, I would want a median line for the values in 1999, 2000, and 2001 in Category A (i.e., 3 lines by color) and then the same in Category B (so 6 median lines total).

I looked here, but this didn't seem to get at what I wanted since it was using facets.

My plot looks like it is drawing a line between the medians of each category. Can stat_summary just draw a median line within each category, or do I need to use a different approach (like calculating the medians and adding each line to the plot by category?

enter image description here

Reproducible simple example

library(tidyverse)
library(lubridate)

# Sample data
Date     <- sort(sample(seq(as.Date("1999-01-01"), as.Date("2002-01-01"), by = "day"), 500))
Category <- rep(c("A", "B"), 250)
Value    <- sample(100:500, 500, replace = TRUE)

# Create data frame
mydata   <- data.frame(Date, Category, Value)

# Plot by category and color by year
p <- ggplot(mydata, aes(x = Category, y = Value,
                        color = factor(year(Date))
                        )
            ) + 
  geom_jitter() 
p


# Now add median values of each year for each group
p <- p +
  stat_summary(fun.y = median,
               geom  = "line",
               aes(color = factor(year(Date))),
               group = 1,
               size = 2
               )
p
markus
  • 25,843
  • 5
  • 39
  • 58
DaveM
  • 664
  • 6
  • 19
  • I'm confused: you have `Category` on the x-axis. Wouldn't you expect the lines to connect from one `Category` to the next? If you want lines within each group, what would they be connecting? Or do you actually just want a point at each of those 6 medians, or a horizontal line denoting the medians? – camille Jul 01 '18 at 22:02
  • The lines would be showing the median in each category by year so one can see where they are in comparison within category and compared to the other category, but actually connecting the lines across categories with the real data set doesn’t make sense in this case. – DaveM Jul 01 '18 at 22:07
  • So more like the post you linked to than a traditional line chart. `geom_line`'s default purpose is more to connect observations, but you want something that's like a point but...a line shape? – camille Jul 01 '18 at 22:12

2 Answers2

5

What you're looking for is actually a point, even though it looks like a line, because you don't want to connect observations (what a line does), you just want to show a discrete value (what a point does).

One way, very similar to the post you linked, is to do your stat_summary and use a shape that is essentially a large dash. I turned down the alpha and size of the jittered points to distinguish them from the medians better. For the medians, I kept the color assignment the same but set the group to the interaction between year and category, so there would be six distinct medians calculated.

Note that I set a seed for random number generation and changed the end date to 12/31/2001 instead of 1/1/2002, since you said you expected 3 years but during one generation I got a few observations of 1/1/2002.

library(tidyverse)
library(lubridate)

set.seed(987)
Date     <- sort(sample(seq(as.Date("1999-01-01"), as.Date("2001-12-31"), by = "day"), 500))
Category <- rep(c("A", "B"), 250)
Value    <- sample(100:500, 500, replace = TRUE)

# Create data frame
mydata   <- data.frame(Date, Category, Value)

mydata <- mydata %>%
  mutate(year = year(Date) %>% as.factor())

ggplot(mydata, aes(x = Category, y = Value, color = year)) +
  geom_jitter(size = 0.6, alpha = 0.6) +
  stat_summary(fun.y = median, 
               geom = "point",
               aes(group = interaction(Category, year)),
               shape = 95, size = 12, show.legend = F)

Created on 2018-07-01 by the reprex package (v0.2.0).

camille
  • 16,432
  • 18
  • 38
  • 60
  • Very helpful Camille. Not exactly what I wanted but close and very clever. – DaveM Jul 01 '18 at 23:03
  • It doesn't seem to work for geom="boxplot" for some reason. – Simon Woodward Sep 27 '19 at 00:08
  • 1
    @SimonWoodward a boxplot has more complicated things that need to be specified than just a point. You should be getting the error message `geom_boxplot requires the following missing aesthetics: lower, upper, middle` – camille Sep 27 '19 at 02:46
  • I solved my problem here https://stackoverflow.com/questions/58108631/how-to-fix-boxplot-code-that-no-longer-works-after-changes-to-ggplot2-3-2-0/58127150#58127150 – Simon Woodward Sep 27 '19 at 02:54
3

Here is another possibility using geom_errorbar (instead of stat_summary)

# Sample data
set.seed(2017);
Date     <- sort(sample(seq(as.Date("1999-01-01"), as.Date("2002-01-01"), by = "day"), 500))
Category <- rep(c("A", "B"), 250)
Value    <- sample(100:500, 500, replace = TRUE)
mydata   <- data.frame(Date, Category, Value)

mydata %>%
    mutate(colour = factor(year(Date))) %>%
    group_by(Category, year(Date)) %>%
    mutate(Median = median(Value)) %>%
    ggplot(aes(Category, Value, colour = colour)) +
    geom_jitter() +
    geom_errorbar(
        aes(ymin = Median, ymax = Median))

enter image description here

Explanation: We pre-compute median values per Category per year(Date) and draw median lines using geom_errorbar.


Update

In response to your comment, if you wanted to use summarise to pre-compute median values you could store median values in a separate data.frame

df <- mydata %>%
    mutate(Year = as.factor(year(Date))) %>%
    group_by(Category, Year) %>%
    summarise(Median = median(Value))

ggplot(mydata, aes(Category, Value, colour = factor(year(Date)))) +
    geom_jitter() +
    geom_errorbar(
        data = df,
        aes(x = Category, y = Median, colour = Year, ymin = Median, ymax = Median))

It's not quite as clean as the first solution (since you need to specify all aesthetics in geom_errorbar) but the resulting plot is the same.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Brilliant, thank you. I was trying to do something similar by layering geom_hline after computing the medians by year in a separate data frame (and was having some issues layering), but this works perfectly. – DaveM Jul 01 '18 at 23:06
  • Maurits, if I could ask a follow up: if I computed medians separately as mentioned with mymedians <- mydata %>% group_by(Category, factor(year(Date))) %>% summarize(median_by_year = median(Value)) is there a way to layer the median lines from this data frame onto the original plot? Your answer works great, I was just wondering if there was another way to layer the same info. – DaveM Jul 01 '18 at 23:09
  • @DaveM I've updated my post to give an example in response to your comment. Please take a look. – Maurits Evers Jul 01 '18 at 23:17