12

I need to write a function that would allow me to quickly do a dual axis plot using ggplot2. I know that dual axis plots are generally deprecated, but still I think it may be useful if you're after observing similar patterns in time series (for all of those who disagree, please treat this question strictly technically). It is actually possible with sec_axis() function from ggplot2, but it needs a defined formula. So here's my attempt to calculate this automatically:

dual_plot <- function(data, x, y_left, y_right){
  x <- ensym(x)
  y_left <- ensym(y_left)
  y_right <- ensym(y_right)

  ratio_model <- lm(eval(y_left) ~ eval(y_right), data = data)

  data %>% 
    select(!!x, !!y_left, !!y_right) %>% 
    mutate(!!y_right := predict(ratio_model)) %>% 
    gather(k, v, -!!x) %>% 
    ggplot() + 
    geom_line(aes(!!x, v, colour = k)) +
    scale_y_continuous(sec.axis = sec_axis(~ . / ratio_model$coefficients[[2]] -
                                             ratio_model$coefficients[[1]],
                                           name = rlang::as_string(y_right))) + 
    labs(y = rlang::as_string(y_left))
}

However, lm may fit a negative direction coefficient which reverse the trend and is really misleading. So I need another approach to calculating this formula - either using linear regression with coefficient constrain or a clever way of fitting a formula. How can it be done in R? Or what are the alternatives to sec_axis that would allow to draw dual axis plot automatically?

@Edit: One example would be:

df <- structure(list(date = structure(c(17167, 17168, 17169, 17170, 
17171, 17172, 17173, 17174, 17175, 17176, 17177, 17178, 17179, 
17180, 17181), class = "Date"), y_right = c(-107073.90734625, 
-633197.630546488, -474626.43291613, -306006.801458608, 56062.072352192, 
522580.236751187, 942796.389093215, -101845.73678439, -632658.677118481, 
-479257.088784885, -303439.231633988, 50273.2477880417, 521669.062954895, 
948127.92455586, -107073.90734625), y_left = c(1648808.16, 3152543.07, 
2702739.91, 2382616.25, 1606089.88, 1592465.75, 1537283.99, 2507221.61, 
3049076.19, 3125424.4, 2774215.1, 2356412.98, 1856506.41, 1477195.08, 
2485713.2)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-15L))

df %>% 
  dual_plot(date, y_left, y_right)

enter image description here

The calculated ratio model has direction coefficient of -1.02, so the y_right is reversed (where the function is decreasing, the plotted function is increasing and the other way around) and hence misleading.

jakes
  • 1,964
  • 3
  • 18
  • 50

2 Answers2

7

Here's an approach that sets a minimum acceptable ratio between the two slopes; if the ratio is less, the slope is not transformed but rather just the level, thus preventing excessively misleading charts like the one you described.

I have the threshold set at 0.1, but it could be 0 if you just want to avoid the specific case here, where you don't want the second series flipped to make it align.

dual_plot <- function(data, x, y_left, y_right){
  x <- ensym(x)
  y_left <- ensym(y_left)
  y_right <- ensym(y_right)

  min_slope_ratio <- 0.1
  ratio_model <- lm(eval(y_left) ~ eval(y_right), data = data)
  ratio_slope <- ratio_model$coefficients[[2]]

  if (ratio_model$coefficients[[2]] < min_slope_ratio) {
    ratio_model <- lm(eval(y_left) ~ 1, data = data)
    ratio_slope <- min_slope_ratio
  }
  ratio_intercept <- ratio_model$coefficients[[1]]


  data %>%
    select(!!x, !!y_left, !!y_right) %>%
    mutate(!!y_right := !!y_right * ratio_slope + ratio_intercept) %>%
    # mutate(!!y_right := predict(ratio_model)) %>%
    gather(k, v, -!!x) %>%
    ggplot() +
    geom_line(aes(!!x, v, colour = k)) +
    scale_y_continuous(sec.axis = sec_axis(~ . / ratio_slope -
                                             ratio_intercept,
                                           name = rlang::as_string(y_right))) +
    labs(y = rlang::as_string(y_left))
}

Here, the limit is triggered and we avoid flipping the 2nd series

df %>% 
  dual_plot(date, y_left, y_right)

enter image description here

Here, the limit is not triggered.

df %>%
  mutate(y_right = -1 * y_right) %>%
  dual_plot(date, y_left, y_right)

enter image description here

Jon Spring
  • 55,165
  • 4
  • 35
  • 53
3

The ggplot capability to allow secondary axes (from version 2.2 on) is mostly a labelling benefit. You still have to project your secondary data onto the proper range. I think the easiest and safest way to accomplish that is a min-max transformation, using the ranges to:

  1. Project the second series onto the range of the first to plot the points.
  2. Project back the other way for the labels.

Note there are many ways for this to be misleading in its own way, including the fact that it will use the full range for the secondary variable even if it definitely shouldn't. Take care.

Simple code without function

df %>%
  select(date, y_left, y_right) %>%
  mutate(y_right = scales::rescale(y_right, to=range(df$y_left))) %>%
  gather(key, value, -date) %>%
  ggplot() +
  geom_line(aes(x = date, y = value, color = key)) +
  scale_y_continuous(sec.axis = sec_axis(~ scales::rescale(., to=range(df$y_right)),
                              name = "Right side")) +
  labs(y = "Left side",
       color = "Series")

Dynamic code using tidyeval function

I've tried to conserve your code, and focus on the use of scales::rescale to project from one range to another.

library(scales)
library(tidyverse)

dual_plot <- function(data, x, y_left, y_right) {
  x <- ensym(x)
  y_left <- ensym(y_left)
  y_right <- ensym(y_right)

  # Introducing ranges
  left_range <- range(data %>% pull(!!y_left))
  right_range <- range(data %>% pull(!!y_right))

  data %>%
    select(!!x, !!y_left, !!y_right) %>%
    # Transform
    mutate(!!y_right := scales::rescale(!!y_right, to=left_range)) %>%
    gather(k, v, -!!x) %>%
    ggplot() +
    geom_line(aes(!!x, v, colour = k)) +
    # Change secondary axis scaling and label
    scale_y_continuous(sec.axis = sec_axis(~ scales::rescale(., to=right_range),
                                           name = rlang::as_string(y_right))) +
    labs(y = rlang::as_string(y_left),
         color = "Series")
}

Example output

I think this output, while different than other answers, preserves the nature of the data and ranges for both primary and secondary variables and their axes.

df %>%
  dual_plot(date, y_left, y_right)

enter image description here

More detail on SO here.

Comments welcomed.

ravic_
  • 1,731
  • 9
  • 13
  • I like that. But is there any way to extend this approach to use it with facets? Like `y_left` is stable, but `y_right` is changing across facets. We could use `mutate_at(vars(!!!y_right), ~scales::rescale(.x, to = left_range))` to scale the `y_right` variables, but the opposite scaling within `sec_axis` is not so clear. – jakes Nov 19 '19 at 12:47
  • I don't this plays well with facets, at least not yet. To start, I think there's a bug where the secondary axis isn't updated when `scales = "free_y"`. (Will research, and then file). Then there are control problems like you describe where you might want to free just the secondary. In this approach, I think I'd create a plots for each group and use `ggarrange` or similar to put them on a single plot. Not ideal, to be sure. Will update my answer with this caution. – ravic_ Nov 19 '19 at 15:47