adding geom_* in a for loop

Question

I want to compare real world data with simulated data within one graph. The code should accept any number of lines to plot. I came up with this:

simulationRuns <- 5 #Variable to be changed depending on how many simulations were made

plotLoop <- ggplot() + 
  geom_line(data = relWorldData, 
            mapping = aes(x = DateTime, y = VALUE, color = "realWorldData"))

for (i in 1:simulationRuns){
    plotLoop <- plotLoop +
      geom_line(data = listOfSimResults[[i]], 
                mapping = aes(x = DateTime, y = VALUE, color = paste0("simRun-", i)))
  }

figureLoop <- ggplotly(plotLoop)

The problem is, that all lines are displayed as simRun-5 and therefore not independent -

I am new to R so please have mercy ;) Thanks in advance, Patrick

FollowUp Question bc. code is terrible to read in a comment:

I read up on Lapply and rewrote the code to this:

plotLoop <- ggplot() + geom_line(data = relWorldData, mapping = aes(x = DateTime, y = VALUE, color = "RealWorldData"))

  addGeomLine <- function (i, obj){
    obj <- obj +
      geom_line(data = listOfSimResults[[i]], mapping = aes(x = DateTime, y = VALUE, color = paste0("simRun-", i)))
  }
  lapply(1:runs, addGeomLine, plotLoop)

  figureLoop <- ggplotly(plotLoop)

This time, only the RealWorldData is displayed, but none of the Simulations. Could you tell me what I am missing?

Limey · Accepted Answer · 2020-07-25T07:29:08.673

Welcome to SO!

You've run into a subtle problem that confuses a lot of people with far more experience than yourself. The problem is that ggplot2 evaluates lazily. Put simply, that means that it "makes a note" of what it needs to do when you tell it what you want, but doesn't actually do anything until the last possible moment.

Here, you tell ggplot that you want to add a geom in your for loop. ggplot makes a note of the geom's definition, but doesn't evaluate it. "At the last moment" is when you call ggplotly. Now ggplot realises it's got some work to do. For each geom, it notices that it needs to know the value of i. So it looks it up and finds the value 5. Hence your problem.

There are several ways to solve this. With your code, my preferred option is to replace the for loop with an lapply. Unlike a for loop, lapply forces evaluation of variables at the time of execution.

I believe you could also keep the for loop and wrap each reference to i in force(), though I've not personally tried that.

The best approach in the long run, in my opinion, would be to make your workflow tidy and avoid the need for either the for loop or lapply altogether. This will also give you the benefits of more compact, robust and readable code that will almost certainly run faster. [I did some work the other day that converted a loop similar to yours to a tidy solution and the run time was reduced from nearly 40 seconds to under 2.]

Also, please read this post for advice on how to create a minimum working example. Providing MWEs will maximise your chances of getting a useful answer.

Update

To expand on my comment about the advantages of using a tidy data approach...

First synthesize some data as you haven't provided any. I'll try to match the structure of your data, but not your values. The only difference to your datasets is that I've added an ID variable to identify the simulation run/real world dataset that each observation comes from.

library(lubridate)
library(tidyverse)

inVivoBG <- tibble(
              ID="Real-world data",
              DateTime2=seq(as_date("2006-03-01"), as_date("2015-03-01"), "3 months"),
              VALUE=100 + rnorm(37, mean=150, sd=20)
            ) 

listOfSimResults <- lapply(
                      1:5, 
                      function(x) {
                        tibble(
                          ID=paste0("simRun-", x),
                          DateTime2=seq(as_date("2006-03-01"), as_date("2015-03-01"), "3 months"),
                          VALUE=100 + rnorm(37, mean=150, sd=20)
                        )
                      }
                    )

Now combine the various data frames into a single one.

data <- bind_rows(inVivoBG, listOfSimResults)

At this point, the construction of your plot is a single line call.

data %>% 
  ggplot() + 
    geom_line(mapping = aes(x = DateTime2, y = VALUE, color = ID))

Giving

This approach avoids the need for a custom function or the need for lapply. It is also robust with respect to the number of lines required and their labels. Personally, I also think it's far easier to understand.

Quick followUp question: I read up on lapply and re-qrote the code: `plotLoop <- ggplot() + geom_line(data = inVivoBG, mapping = aes(x = DateTime2, y = VALUE, color = "RealWorldData")) addGeomLine <- function (i, obj){ obj <- obj + geom_line(data = listOfSimResults[[i]], mapping = aes(x = DateTime, y = subjE.Gp.conc, color = paste0("simRun-", i))) } lapply(1:runs, addGeomLine, plotLoop) figureLoop <- ggplotly(plotLoop)` This time only the RealWorldData line is plotted, none of the others. Could you tell me what I am missing? — Patrick Nit, Jul 22 '20 at 09:07
Yep. Your function `addGeomLine` isn't returning anything. Remove the `obj <-` or add `return(obj)`. — Limey, Jul 22 '20 at 09:11
Thought about that too, but even with this addition, its not working. Only the RealWorldData is displayed — Patrick Nit, Jul 22 '20 at 09:29
This illustrates why it's so helpful to provide a MWE so your code can be tested. Your problem is that you use local assignment to update the plot object in `addGeomLine` function. You need a global assignment. This version of your function gives the desired result. `addGeomLine <- function (i) { plotLoop <<- plotLoop + geom_line(data = listOfSimResults[[force(i)]], mapping = aes(x = DateTime2, y = VALUE, color = paste0("simRun-", force(i)))) } lapply(1:5, addGeomLine)`. But I urge you to adopt the solution in my updated answer. — Limey, Jul 22 '20 at 09:39

adding geom_* in a for loop

1 Answers1

Linked