3

Could you help me optimize the code below? As you can see, I'm using the same date twice, once for graph generation and once for subset y generation. The result is correct, but I'd like some help trying to optimize to at least use the date only once and another optimizing that you find necessary. Every help is welcome.

Thank you very much!

library(dplyr)
library(lubridate)
library(tidyverse)

#dataset
df <- structure(
  list(date1 = c("2021-06-28","2021-06-28","2021-06-28","2021-06-28","2021-06-28",
                 "2021-06-28","2021-06-28","2021-06-28"),
       date2 = c("2021-04-02","2021-04-03","2021-04-08","2021-04-09","2021-04-10","2021-07-01","2021-07-02","2021-07-03"),
       Week= c("Friday","Saturday","Thursday","Friday","Saturday","Thursday","Friday","Monday"),
       DR01 = c(4,1,4,3,3,4,3,6), DR02= c(4,2,6,7,3,2,7,4),DR03= c(9,5,4,3,3,2,1,5),
       DR04 = c(5,4,3,3,6,2,1,9),DR05 = c(5,4,5,3,6,2,1,9),
       DR06 = c(2,4,3,3,5,6,7,8),DR07 = c(2,5,4,4,9,4,7,8)),
  class = "data.frame", row.names = c(NA, -8L))

#Generate graph

dmda<-"2021-07-01"
dta<-df

datas<-dta %>%
  filter(date2 == ymd(dmda)) %>%
  summarize(across(starts_with("DR"), sum)) %>%
  pivot_longer(everything(), names_pattern = "DR(.+)", values_to = "val") %>%
  mutate(name = as.numeric(name))
colnames(datas)<-c("Days","Numbers")
attach(datas)
plot(Numbers ~ Days, ylim=c(0,20))

model <- nls(Numbers ~ b1*Days^2+b2,start = list(b1 = 47,b2 = 0))

new.data <- data.frame(Days = seq(min(Days),max(Days),len = 45))
lines(new.data$Days,predict(model,newdata = new.data))


#Add the y points to the graph

df[, 1:2] = lapply(df[, 1:2], FUN = as_date)

get_cutoff = function(date) {
  date2 = as_date(date)
  date1 = df[1,1]
  as.numeric(date2 - date1 + 1)
}

subset_data = function(date, start_index) {
  date = as_date(date)
  if (date > df[1,1]) {
    end_index = start_index + get_cutoff(date) - 1
    df[, -c(start_index:end_index)] %>%
      filter(date2 == date)
  } else {
    return(df)
  }
} 

y<-subset_data("2021-07-01", 4)
y

pivot_longer(y, 
             cols=c(starts_with("DR"))) %>% 
  mutate(day = parse_number(name)) -> new_y
new_y

lines(x=new_y$day, y=new_y$value, col="red")
points(x=new_y$day, y=new_y$value, col="red")

enter image description here

  • 1
    [Don't use `attach`](https://stackoverflow.com/questions/10067680/why-is-it-not-advisable-to-use-attach-in-r-and-what-should-i-use-instead) – Ronak Shah Sep 10 '21 at 12:57
  • 1
    Towards what do you want to optimize? Speed? Readability? Is this really the bottleneck of your code? – danlooo Sep 10 '21 at 13:21
  • In fact, my optimization idea is to not need to use the same date twice, that is, I would only put the date in once. And if you think you could leave the code in a better order, that's also welcome. –  Sep 10 '21 at 13:30

2 Answers2

2

Make these changes:

  • only load packages used
  • can eliminate lubridate
  • don't need dta
  • in filter we don't need to convert dmda to Date class
  • pivot_wider can transform the names
  • don't use attach
  • the model is linear in the parameters so use lm, not nls
  • replace the new.data/lines with curve
  • don't overwrite df
  • simplify the cutoff calculation
  • use type = "o" to reduce points/lines to just lines
  • use subset in lines

Now assuming that df and dmda have been defined as in the question we have this.

library(dplyr)
library(tidyr)

datas <- df %>%
  filter(date2 == dmda) %>%
  summarize(across(starts_with("DR"), sum)) %>%
  pivot_longer(everything(), names_pattern = "DR(.+)", 
    names_to = "Days", values_to = "Numbers", 
    names_transform = list(Days = as.numeric))

plot(Numbers ~ Days, datas, ylim=c(0,20))

model <- lm(Numbers ~ I(Days^2), datas)
rng <- range(datas$Days)
curve(predict(model, list(Days = x)), rng[1], rng[2], add = TRUE)

# assume this for cutoff.  You may or may not need to change this line.
cutoff <- as.numeric(as.Date(dmda) - first(as.Date(df$date1))) + 1
lines(Numbers ~ Days, datas, subset = seq_len(nrow(datas)) > cutoff,
  type = "o" , col = "red")

screenshot

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • your flowchart is that I should have known or taught when i started learning and using R – PesKchan Sep 10 '21 at 14:17
  • Thanks so much Grothendieck, however it's not working when I run your code. Could you please test it again and also insert the dmda with the date in your code –  Sep 10 '21 at 15:22
  • Have fixed. Try it now. First copy df and dmda to a fresh session of R and then copy the code in the answer. – G. Grothendieck Sep 10 '21 at 15:31
1

I used ggplot rather than base R plotting functions since you are already working in the tidyverse. The following will do the trick to plot it all on a single graph.

dmda<-"2021-07-01"
dta<-df

## Rather than rely on column position, explicitly set the number
## of days desired for highlighting on plot
num_days <- 3

model <- nls(Numbers ~ b1*Days^2+b2,start = list(b1 = 47,b2 = 0))
new.data <- data.frame(Days = seq(min(Days),max(Days),len = 45)) %>%
    mutate(Numbers = predict(model, newdata = .))

datas<-dta %>%
    filter(date2 == ymd(dmda)) %>%
    summarize(across(starts_with("DR"), sum)) %>%
    ## Can convert data to numeric and create column names inside pivot_longer
    pivot_longer(everything(), names_pattern = "DR(.+)", 
                 values_to = "Numbers", names_to = "Days",
                 names_transform = list(Days = as.numeric, Numbers = as.numeric)) %>%
    ## Create flag for whether the values are in the final number of days
    mutate(subs = 1:n() > (n() - num_days))


plt <- ggplot(datas, aes(x = Days, y = Numbers)) +
    geom_point(aes(color = subs)) +
    geom_line(data = filter(datas, subs == TRUE), color = "red") +
    geom_line(data = new.data, color = "black") +
    scale_y_continuous(limits = c(0, 20)) +
    scale_color_manual(values = c("black", "red"))
plt

enter image description here

mikebader
  • 1,075
  • 3
  • 12