How to optimize code that generates graph in R

Question

Could you help me optimize the code below? As you can see, I'm using the same date twice, once for graph generation and once for subset y generation. The result is correct, but I'd like some help trying to optimize to at least use the date only once and another optimizing that you find necessary. Every help is welcome.

Thank you very much!

library(dplyr)
library(lubridate)
library(tidyverse)

#dataset
df <- structure(
  list(date1 = c("2021-06-28","2021-06-28","2021-06-28","2021-06-28","2021-06-28",
                 "2021-06-28","2021-06-28","2021-06-28"),
       date2 = c("2021-04-02","2021-04-03","2021-04-08","2021-04-09","2021-04-10","2021-07-01","2021-07-02","2021-07-03"),
       Week= c("Friday","Saturday","Thursday","Friday","Saturday","Thursday","Friday","Monday"),
       DR01 = c(4,1,4,3,3,4,3,6), DR02= c(4,2,6,7,3,2,7,4),DR03= c(9,5,4,3,3,2,1,5),
       DR04 = c(5,4,3,3,6,2,1,9),DR05 = c(5,4,5,3,6,2,1,9),
       DR06 = c(2,4,3,3,5,6,7,8),DR07 = c(2,5,4,4,9,4,7,8)),
  class = "data.frame", row.names = c(NA, -8L))

#Generate graph

dmda<-"2021-07-01"
dta<-df

datas<-dta %>%
  filter(date2 == ymd(dmda)) %>%
  summarize(across(starts_with("DR"), sum)) %>%
  pivot_longer(everything(), names_pattern = "DR(.+)", values_to = "val") %>%
  mutate(name = as.numeric(name))
colnames(datas)<-c("Days","Numbers")
attach(datas)
plot(Numbers ~ Days, ylim=c(0,20))

model <- nls(Numbers ~ b1*Days^2+b2,start = list(b1 = 47,b2 = 0))

new.data <- data.frame(Days = seq(min(Days),max(Days),len = 45))
lines(new.data$Days,predict(model,newdata = new.data))


#Add the y points to the graph

df[, 1:2] = lapply(df[, 1:2], FUN = as_date)

get_cutoff = function(date) {
  date2 = as_date(date)
  date1 = df[1,1]
  as.numeric(date2 - date1 + 1)
}

subset_data = function(date, start_index) {
  date = as_date(date)
  if (date > df[1,1]) {
    end_index = start_index + get_cutoff(date) - 1
    df[, -c(start_index:end_index)] %>%
      filter(date2 == date)
  } else {
    return(df)
  }
} 

y<-subset_data("2021-07-01", 4)
y

pivot_longer(y, 
             cols=c(starts_with("DR"))) %>% 
  mutate(day = parse_number(name)) -> new_y
new_y

lines(x=new_y$day, y=new_y$value, col="red")
points(x=new_y$day, y=new_y$value, col="red")

[Don't use `attach`](https://stackoverflow.com/questions/10067680/why-is-it-not-advisable-to-use-attach-in-r-and-what-should-i-use-instead) — Ronak Shah, Sep 10 '21 at 12:57
Towards what do you want to optimize? Speed? Readability? Is this really the bottleneck of your code? — danlooo, Sep 10 '21 at 13:21
In fact, my optimization idea is to not need to use the same date twice, that is, I would only put the date in once. And if you think you could leave the code in a better order, that's also welcome. — , Sep 10 '21 at 13:30

G. Grothendieck · Accepted Answer · 2021-09-14T03:12:08.550

Make these changes:

only load packages used
can eliminate lubridate
don't need dta
in filter we don't need to convert dmda to Date class
pivot_wider can transform the names
don't use attach
the model is linear in the parameters so use lm, not nls
replace the new.data/lines with curve
don't overwrite df
simplify the cutoff calculation
use type = "o" to reduce points/lines to just lines
use subset in lines

Now assuming that df and dmda have been defined as in the question we have this.

library(dplyr)
library(tidyr)

datas <- df %>%
  filter(date2 == dmda) %>%
  summarize(across(starts_with("DR"), sum)) %>%
  pivot_longer(everything(), names_pattern = "DR(.+)", 
    names_to = "Days", values_to = "Numbers", 
    names_transform = list(Days = as.numeric))

plot(Numbers ~ Days, datas, ylim=c(0,20))

model <- lm(Numbers ~ I(Days^2), datas)
rng <- range(datas$Days)
curve(predict(model, list(Days = x)), rng[1], rng[2], add = TRUE)

# assume this for cutoff.  You may or may not need to change this line.
cutoff <- as.numeric(as.Date(dmda) - first(as.Date(df$date1))) + 1
lines(Numbers ~ Days, datas, subset = seq_len(nrow(datas)) > cutoff,
  type = "o" , col = "red")

your flowchart is that I should have known or taught when i started learning and using R — PesKchan, Sep 10 '21 at 14:17
Thanks so much Grothendieck, however it's not working when I run your code. Could you please test it again and also insert the dmda with the date in your code — , Sep 10 '21 at 15:22
Have fixed. Try it now. First copy df and dmda to a fresh session of R and then copy the code in the answer. — G. Grothendieck, Sep 10 '21 at 15:31

score 1 · Answer 2 · answered Sep 10 '21 at 13:52

I used ggplot rather than base R plotting functions since you are already working in the tidyverse. The following will do the trick to plot it all on a single graph.

dmda<-"2021-07-01"
dta<-df

## Rather than rely on column position, explicitly set the number
## of days desired for highlighting on plot
num_days <- 3

model <- nls(Numbers ~ b1*Days^2+b2,start = list(b1 = 47,b2 = 0))
new.data <- data.frame(Days = seq(min(Days),max(Days),len = 45)) %>%
    mutate(Numbers = predict(model, newdata = .))

datas<-dta %>%
    filter(date2 == ymd(dmda)) %>%
    summarize(across(starts_with("DR"), sum)) %>%
    ## Can convert data to numeric and create column names inside pivot_longer
    pivot_longer(everything(), names_pattern = "DR(.+)", 
                 values_to = "Numbers", names_to = "Days",
                 names_transform = list(Days = as.numeric, Numbers = as.numeric)) %>%
    ## Create flag for whether the values are in the final number of days
    mutate(subs = 1:n() > (n() - num_days))


plt <- ggplot(datas, aes(x = Days, y = Numbers)) +
    geom_point(aes(color = subs)) +
    geom_line(data = filter(datas, subs == TRUE), color = "red") +
    geom_line(data = new.data, color = "black") +
    scale_y_continuous(limits = c(0, 20)) +
    scale_color_manual(values = c("black", "red"))
plt

How to optimize code that generates graph in R

2 Answers2