Improve loops in R for time efficiency when having a lot of data [over half a million rows]

Question

I have some rather simple R code that takes 10min-20min to execute which I believe to be unnecessary time consuming. The data consist of a data frame with approximately 30 columns and 500.000 rows. The aim of the loop is to look what kind of bin a certain value should be put in.

I have tried to do improve the code by adding the entire column before the loop, doing some calculations outside the loop after reading some other threads regarding the topic but none of these methods have improved the code significantly.

col_days <- Sys.Date() - as.Date(df$col)
i=1
while (i < length(df$col)){
  if (Sys.Date() - as.Date(df$col[i]) <366){
    df$col_bin[i] <- "Less than 1 year"
    i=i+1
  }

  else if (between(Sys.Date() - as.Date(df$col[i]), 366, 1095)){
    df$col_bin[i] <- "1 year to 3 years"
    i=i+1
  }
  else if (between(Sys.Date() - as.Date(df$col[i]), 1096, 1825)){
    df$col_bin[i] <- "3 years to 5 years"
    i=i+1
  }
  else if (between(Sys.Date() - as.Date(df$col[i]), 1826, 3650)){
    df$col_bin[i] <- "5 years to 10 years"
    i=i+1
  }
  else{
    df$col_bin[i] <- "More than 10 years"  
    i=i+1
  }
}

So with this version of the code, it takes approximately 15 minutes to compute all rows. I believe that there are several ways to improve this. Suggestions?

Can you share a small example of your data? And expected output?From the looks of it the function `cut()` may be related to this problem. — RLave, Feb 08 '19 at 08:28
Yes: replace all of that with: `df$col_bin <- cut(df$col, c(0, 366, 1069, 1826, 3651, Inf), labels = c("<1", "1-3", "3-5", "5-10", >10"))` (using whatever labels you want). No loop. (Mistype: you probably want `cut(Sys.Date() - as.Date(df$col), ...)`.) — r2evans, Feb 08 '19 at 08:29
Why are you using loops for this? These look to me to be pretty easy to do with simply functions either base r or the tidyverse. I don't even think you necessarily need to use *apply. Also you should definitely look at case_when. Also why not just calculate the difference and do division, round, then make it an ordered factor with the levels you want? Also what about leap years? WHy not just use date math? — Elin, Feb 08 '19 at 08:32
Thanks for all the feedback! The example solution presented by r2evans works fantastic. It now only takes a second instead of 15 minutes. Really, really helpful. Thank you all, I realise that I focused too much on using a loop instead of this kind of idea. Wonderful! — Raz89, Feb 08 '19 at 09:14

s_baldur · Answer 1 · 2019-02-08T08:49:21.463

Here is a solution using dplyr::case_when() (which I find more tractable than base::cut()):

library(dplyr)
df %>% 
  mutate(
    col_bin = case_when(
      days < 366 ~ "Less than 1 year",
      days < 1095 ~ "1 year to 3 years",
      days < 1825 ~ "3 years to 5 years",
      days < 3650 ~ "5 years to 10 years",
      TRUE ~ "More than 10 years"
    )
  )

          col      days             col_bin
1  2012-02-27 2538 days 5 years to 10 years
2  2014-11-27 1534 days  3 years to 5 years
3  2013-04-06 2134 days 5 years to 10 years
4  2009-08-15 3464 days 5 years to 10 years
5  2017-12-09  426 days   1 year to 3 years
6  2016-01-08 1127 days  3 years to 5 years
7  2015-05-08 1372 days  3 years to 5 years
8  2015-05-20 1360 days  3 years to 5 years
9  2010-09-08 3075 days 5 years to 10 years
10 2013-03-26 2145 days 5 years to 10 years
11 2010-03-15 3252 days 5 years to 10 years
12 2011-05-08 2833 days 5 years to 10 years
13 2017-07-21  567 days   1 year to 3 years

Example data:

set.seed(10)
df <- data.frame(
  col = Sys.Date() - sample(1:5000, size = 13)
)
df[["days"]] <- Sys.Date() - as.Date(df[["col"]])

This is a great example of how to handle the problem in a much easier way, thank you! I will look at both this example and the example involving cut and try to find out which one suits me the best. Thanks again for fantastic feedback! — Raz89, Feb 08 '19 at 09:16

snaut · Answer 2 · 2019-02-08T09:25:02.693

Here is a comparison of four solutions using eiter dplyr or data.table and either case_when and cut.

Thanks snoram for the example data and the dplyr and case_when parts.

In this test dplyr and data.table perform about equally well, but cut is faster than case_when. Compared to you original solution all solutions should be way faster probably absolutely fast enough for a dataset of the size of your dataset.

require(data.table)
require(dplyr)
require(microbenchmark)
require(ggplot2)

set.seed(10)
df <- data.frame(
  col = Sys.Date() - sample(1:5000, size = 13)
)
df[["days"]] <- Sys.Date() - as.Date(df[["col"]])


benchmark <- microbenchmark(
  data.table={
    dt <- data.table(df)
    dt[, col_bin := cut(
      as.numeric(days, units="days"), 
      breaks=c(-Inf, 366, 1095, 1825, 3650, Inf), 
      labels=c(
        "Less than 1 year",
        "1 year to 3 years",
        "3 years to 5 years",
        "5 years to 10 years",
        "More than 10 years"
      ))]
  },
  dplyr={
    res <- df %>% 
      mutate(
        col_bin = case_when(
          days < 366 ~ "Less than 1 year",
          days < 1095 ~ "1 year to 3 years",
          days < 1825 ~ "3 years to 5 years",
          days < 3650 ~ "5 years to 10 years",
          TRUE ~ "More than 10 years"
        )
      )
  },
  `data.table & case_when`={
    dt <- data.table(df)
    dt[, col_bin := case_when(
          days < 366 ~ "Less than 1 year",
          days < 1095 ~ "1 year to 3 years",
          days < 1825 ~ "3 years to 5 years",
          days < 3650 ~ "5 years to 10 years",
          TRUE ~ "More than 10 years"
        )]
  },
  `dplyr & cut`={
    res <- df %>% 
      mutate(
        col_bin = cut(
      as.numeric(days, units="days"), 
      breaks=c(-Inf, 366, 1095, 1825, 3650, Inf), 
      labels=c(
        "Less than 1 year",
        "1 year to 3 years",
        "3 years to 5 years",
        "5 years to 10 years",
        "More than 10 years"
      ))
      )
  }

  )

autoplot(benchmark)

Would be interesting to see a benchmark of a third option where you use `data.table` with `case_when`. And much bigger data. — s_baldur, Feb 08 '19 at 09:15
btw. `data.table` is hard to read at first but after a few weeks I think it gets more readable than `dplyr` because of concision. — s_baldur, Feb 08 '19 at 09:17
Hmm, ok I'll edit this answer completely, seems dplyr is not the slow thing here, but case when, updated version incoming. — snaut, Feb 08 '19 at 09:18
I used `dt <- data.table(df)` because I want to have the conversion to `data.table` in every evlauation. As far as I recall `setDT` works inplace, so the overhead for the convesion would not be measured. I think depending on the context both could be the right way. If you choose one datastructure for the whole analysis and only convert at the beginning not measuring the overhead would be the right way. — snaut, Feb 08 '19 at 09:29

Improve loops in R for time efficiency when having a lot of data [over half a million rows]

2 Answers2