4

Say I have a data set with x and y values that are grouped according to two variables: grp is a, b, or c, while subgrp is E, F, or G.

  • a has y values in [0, 1]
  • b has y values in [10, 11]
  • c has y values in [100, 101].

I'd like to plot y against x with the colour of the point defined by y for all grp and subgrp combinations. Since each grp has very different y values, I can't just use facet_grid alone, as the colour scales would be useless. So, I plot each grp with its own scale then patch them together with plot_grid from cowplot. I also want to use a three-point gradient specified by scale_colour_gradient2. My code looks like this:

# Set RNG seed
set.seed(42)

# Toy data frame
df <- data.frame(x = runif(270), y = runif(270) + rep(c(0, 10, 100), each = 90),
                 grp = rep(letters[1:3], each = 90), subgrp = rep(LETTERS[4:6], 90))

head(df)
#>           x         y grp subgrp
#> 1 0.9148060 0.1362958   a      D
#> 2 0.9370754 0.7853494   a      E
#> 3 0.2861395 0.4533034   a      F
#> 4 0.8304476 0.1357424   a      D
#> 5 0.6417455 0.8852210   a      E
#> 6 0.5190959 0.3367135   a      F

# Load libraries
library(cowplot)
library(ggplot2)
library(dplyr)

# Plotting list
g_list <- list()

# Loop through groups 'grp'
for(i in levels(df$grp)){
  # Subset the data
  df_subset <- df %>% filter(grp == i)
  
  # Calculate the midpoint
  mp <- mean(df_subset$y)
  
  # Print midpoint
  message("Midpoint: ", mp)
  
  g <- ggplot(df_subset) + geom_point(aes(x = x, y = y, colour = y))
  g <- g + facet_grid(. ~ subgrp) + ggtitle(i)
  g <- g + scale_colour_gradient2(low = "blue", high = "red", mid = "yellow", midpoint = mp)
  g_list[[i]] <- g
}
#> Midpoint: 0.460748857570191
#> Midpoint: 10.4696476330981
#> Midpoint: 100.471083269571

plot_grid(plotlist = g_list, ncol = 1)

Created on 2019-04-17 by the reprex package (v0.2.1)

In this code, I specify the midpoint of the colour gradient as the mean of y for each grp. I print this and verify that it is correct. It is.

My question: why are my colour scales incorrect for the first two plots?

It appears the same range is applied to each grp despite subsetting the data. If I replace for(i in levels(df$grp)){ with for(i in levels(df$grp)[1]){, the colour scale is correct for the single plot that is produced.


Update

Okay, this is weird. Inserting ggplot_build(g)$data[[1]]$colour immediately before g_list[[i]] <- g solves the problem. But, why?

enter image description here

Community
  • 1
  • 1
Dan
  • 11,370
  • 4
  • 43
  • 68
  • 3
    This commonly comes up with looping and ggplot2. I'm not sure in your exact case, but it likely this has something to do with when variables are evaluated in the plot. See the explanation [here](https://stackoverflow.com/a/39057372/2461552) and info [here](https://stackoverflow.com/questions/26235825/for-loop-only-adds-the-final-ggplot-layer) – aosmith Apr 17 '19 at 14:27
  • @aosmith That's really interesting. So, presumably `ggplot_build(g)$data[[1]]$colour` forces evaluation and thus retains the colours as they should be? It seems an alternative is to `print` the plots invisibly: `invisible(print(g))` just before `g_list[[i]] <- g`. – Dan Apr 17 '19 at 14:31
  • 1
    That's my guess. One of the things I like about the approach of splitting the dataset into a list by groups and then looping through the datasets to make many **ggplot2** plots with `lapply()`/`purrr::map()` is that it avoids some of this. – aosmith Apr 17 '19 at 14:36
  • @aosmith Good stuff. I'll try that. Thanks for your help. – Dan Apr 17 '19 at 14:38
  • i had a similar case with looping and ggplot, the answer was, ggplot has problems with local variables. I think it's kinda the same here. Still not exactly sure why ggplot behaves like this. [my old question](https://stackoverflow.com/questions/54808795/how-to-add-multiple-curves-functions-to-one-ggplot-through-looping) – mischva11 Apr 17 '19 at 14:55

1 Answers1

2

Long story short, you're creating unevaluated promises and then evaluate them at a time when the original data is gone. This problem is generally avoided if you use proper functional programming style rather than procedural code. I.e., define a function that does the work and then use an apply function for the loop.

set.seed(42)

# Toy data frame
df <- data.frame(x = runif(270), y = runif(270) + rep(c(0, 10, 100), each = 90),
                 grp = rep(letters[1:3], each = 90), subgrp = rep(LETTERS[4:6], 90))

library(cowplot)
library(ggplot2)
library(dplyr)

# Loop through groups 'grp'
g_list <- lapply(
  levels(df$grp), 
  function(i) {
    # Subset the data
    df_subset <- df %>% filter(grp == i)

    # Calculate the midpoint
    mp <- mean(df_subset$y)

    # Print midpoint
    message("Midpoint: ", mp)

    g <- ggplot(df_subset) + geom_point(aes(x = x, y = y, colour = y))
    g <- g + facet_grid(. ~ subgrp) + ggtitle(i)
    g <- g + scale_colour_gradient2(low = "blue", high = "red", mid = "yellow", midpoint = mp)
    g
  }
)
#> Midpoint: 0.460748857570191
#> Midpoint: 10.4696476330981
#> Midpoint: 100.471083269571

plot_grid(plotlist = g_list, ncol = 1)

Created on 2019-04-17 by the reprex package (v0.2.1)

Claus Wilke
  • 16,992
  • 7
  • 53
  • 104
  • Consider `by`, more streamlined than the nested `lapply` + `split` or `lapply` + `levels` or `lapply` + `unique`. – Parfait Apr 17 '19 at 20:44