R: Create waterfall chart by comparing multiple values to one

Question

Main question: can you find a cleaner way to compare multiple observations to one single value in a long-oriented table?

The objective is to create a waterfall chart that displays votes for a party ("A") and compare it to competitors.

The input is a list (tibble) containing three columns: neighbourhood, party, and pct_votes. Example:

prop.long
# A tibble: 304 x 3
   neighbourhood    party       pct_votes
   <fct>            <chr>        <dbl>
 1 Region-A         A           0.0938
 2 Region-A         B           0.0672
 3 Region-A         C           0.0906
 4 Region-A         D           0.228
 5 Region-A         E           0.0318 
 6 Region-B         A           0.0932
 7 Region-B         B           0.118
 8 Region-B         C           0.0837
 9 Region-B         D           0.199
10 Region-B         E           0.0544

To display whether party A was doing better or worse than competitors the direction attribute is required. The y-variables will be used to display the bars. If a party did worse than A, y_min should be set to the party's value, otherwise, party A's pct_votes value should be taken as y_min. y_max should be set party A's pct_votes value if it performed worse, and otherwise take their own pct_votes value. The x-variables are used to display the political parties side-by-side in a particular order in the plot.

This is the desired output:

prop.wf
# A tibble: 76 x 10
# Groups:   neighbourhood [19]
   neighbourhood    party   pct_votes   tmp     direction   y_min y_max x_min x_max
   <fct>            <fct>   <dbl>       <dbl>   <chr>       <int> <int> <int> <int>
 1 Region-A         A       0.0938      0.0938  target      0     9     0     1
 2 Region-A         B       0.0672      0.0938  lower       6     9     1     2
 3 Region-A         C       0.0906      0.0938  lower       9     9     3     4
 4 Region-A         D       0.228       0.0938  higher      9    22     4     5
 5 Region-B         A       0.0932      0.0932  target      0     9     0     1
 6 Region-B         B       0.118       0.0932  higher      9    11     1     2
 7 Region-B         C       0.0837      0.0932  lower       8     9     3     4
 8 Region-B         D       0.199       0.0932  higher      9    19     4     5
# … with 68 more rows

The code that produced the output as desired:

prop.wf <- prop.long %>%
  filter(party %in% c('A', 'B', 'C', 'D')) %>%
  group_by(neighbourhood) %>%
  mutate(tmp = pct_votes[party == 'A']) %>%
  mutate(party = factor(party, levels = c('A', 'B', 'C', 'D')),
    direction = ifelse(party == 'A', 'target', ifelse(pct_votes > tmp, 'higher', ifelse(pct_votes < tmp, 'lower', 'equal'))),
    y_min = as.integer((ifelse(party == 'A', 0, ifelse(direction == 'lower', pct_votes, tmp)) * 100)),
    y_max = as.integer((ifelse(party == 'A', pct_votes, ifelse(direction == 'lower', tmp, pct_votes)) * 100)),
    x_min = as.integer(ifelse(party == 'A', 0, ifelse(party == 'B', 1, ifelse(party == 'C', 2, ifelse(party == 'D', 3, 4))))),
    x_max = as.integer(ifelse(party == 'A', 1, ifelse(party == 'B', 2, ifelse(party == 'C', 3, ifelse(party == 'D', 4, 5)))))) # `x_min + 1` did not yield int, even after casting with `as.integer()

My main question is: can you help me refactor this into more clear/scalable code? (E.g. what if two parties need to be added? Preferable these ifelse() statements are not chained.) I could not stop thinking: "there should be a way easier way to formulate this", but I could not come up with it.

Example of my final output (using ggplot's geom_rect):

Leveraged resources:

score 3 · Accepted Answer · answered Aug 09 '20 at 19:14

You can save a lot of code by doing a geom_col instead of a geom_rect . This involves spoofing the y axis, but it means your whole code including plotting looks like this:

library(dplyr)
library(ggplot2)

df %>% 
  group_by(neighbourhood) %>%
  mutate(pct = ifelse(party == "A", -pct_votes, 
                      pct_votes - pct_votes[party == "A"]),
         fill = ifelse(party == "A", "#fd9826", 
                       c("#3eca3f", "", "#ca1f15")[sign(pct) + 2])) %>%
  ggplot(aes(party, pct, fill = fill)) +
  geom_col(color = "gray50", width = 1) +
  facet_grid(~neighbourhood) +
  scale_y_continuous(limits = c(min(-df$pct_votes[df$party == "A"]), 0.15),
                     breaks = seq(min(-df$pct_votes[df$party == "A"]), 0.2,  0.05),
                     labels = scales::percent(0:5 / 20)) +
  scale_fill_identity()

^{Created on 2020-08-09 by the reprex package (v0.3.0)}

Thank you Allan, this is exactly what I was looking for. Can you explain how the `-` signs work in the `ifelse()` and the `seq(min())` statements? — Bart Schuijt, Aug 09 '20 at 20:58
Sure @BartSchuijt - calling `signs` on a vector replaces any negative number with -1, and any positive number with + 1, so adding two to the result gives a vector of 1s and 3s for "negative" and "positive". If we use this vector to subset a length-3 vector of color names, it will therefore transform the negative values into the first colour and positive values into the third colour. We can leave the middle color blank because if the value is exactly zero, it won't be seen and doesn't need a colour. — Allan Cameron, Aug 09 '20 at 21:04

score 2 · Answer 2 · answered Aug 09 '20 at 19:56

Here is another way, but not a bar graph like in the question. The code below produces a waterfall graph like the ones in the question's first link or the Wikipedia. The colors are adapted from user Allan Cameron's answer. The data transformation includes code to compute the line segments' end points.

library(tidyverse)

bar.width <- 0.8

prop.long %>%
  group_by(neighbourhood) %>%
  mutate(y_min = dplyr::lag(pct_votes),
         y_max = pct_votes,
         x_min = as.integer(factor(party)) - bar.width/2,
         x_max = as.integer(factor(party)) + bar.width/2,
         xend = lead(x_max),
         yend = lead(y_min)) %>%
  replace_na(list(y_min = 0)) %>%
  mutate(fill = ifelse(party == "A", -y_max, y_max - y_min),
         fill = ifelse(party == "A", "#fd9826",
                       c("#3eca3f", "", "#ca1f15")[sign(fill) + 2])) %>%
  ggplot(aes(xmin = x_min, xmax = x_max, ymin = y_min, ymax = y_max)) +
  geom_rect(aes(fill = fill)) +
  geom_segment(aes(x = x_min, xend = xend, y = yend, yend = yend), size = 0.2) +
  scale_fill_identity() +
  facet_wrap(~ neighbourhood)

Very nice Rui. This is a genuine waterfall plot and the code is elegant. +1 — Allan Cameron, Aug 09 '20 at 20:35
Thank you Rui. I was probably wrong to mention a waterfall chart. A better way to describe it is unknown to me. — Bart Schuijt, Aug 09 '20 at 21:00
@BartSchuijt OK, thanks for the information. Anyway, maybe future users will need something like the code in the answer, so, there it is. — Rui Barradas, Aug 09 '20 at 21:02

R: Create waterfall chart by comparing multiple values to one

2 Answers2