1

My question: I was debugging some code at work, running it block-by-block, when I realized a small block was taking an unusual amount of time. I killed it and made a minor (but logically equivalent) tweak, and it ran almost instantly. I would like to understand why. The following code is in R, however, I imagine the answer may not be specific to R, and may apply to most programming languages of a similar paradigm or 'method-of-compiling'?

The code & information:
Using R version 3.6.1
Libraries loaded: dplyr, knitr, DataExplorer, glue, zoo

old_df is data frame of 5653380 obs. of 91 variables.
field1 is a col of policy numbers with class "character". Not unique, each occurs many times.
date_col1 and date_col2 are columns with class "date".

Method 1:

    new_df <- old_df %>%
      group_by(field1) %>%
      mutate(checkfield = date_col1 - date_col2) %>%
      filter(checkfield < 0) %>%
      filter(row_number() == 1)
    old_df$filter <- ifelse(old_df$field1 %in% new_df$field1,1,0)

Method 2:

    new_df <- old_df %>%
      group_by(field1) %>%
      filter(date_col1 < date_col2) %>%
      filter(row_number() == 1)
    old_df$filter <- ifelse(old_df$field1 %in% new_df$field1,1, 0)

As you can probably see, the intended output of both methods is to add a flag, "1", in the column "filter" for policy numbers where date_col1 < date_col2. I did not write method 1, and my goal in writing method 2 was to change it as little as possible while also making it faster, so please avoid spending too much time talking about problems with method 1 that are not related to why it is unbearably slower than method 2. Feel free to mention such things, but I would like the crux to be why method 1 was taking 20, 30 minutes etc. For example, I believe in method 1, the first filter call could be above the group_by call. This might increase speed by an unnoticeable amount. I am not too concerned about this.

My thoughts: Clearly method 2 might be a little faster because it avoids making the column "checkfield", but I dont think this is the issue, as I ran method 1 line by line, and it appears to be the line 'filter(checkfield < 0)' where things went awry. For testing, I defined two dates x,y and checked class(x-y) which returned "difftime". So in this filter call, we are comparing "difftime" to a "numeric". Perhaps this requires some type of type-juggling to make the comparison, where as method 2 compares a date object to a date object?

Let me know what you think! I am very curious about this.

Prince M
  • 395
  • 2
  • 13
  • can you benchmark/give us a timing comparison on `filter(date_col1 < date_col2)` vs. `filter(checkfield < 0)` to support your conjecture? (See e.g. the `microbenchmark` package ...) – Ben Bolker Aug 20 '19 at 22:20
  • @BenBolker. Yes! I don't know about that, but I will read up on the benchmark documentation. – Prince M Aug 20 '19 at 22:22
  • I actually prefer the `rbenchmark` package to `microbenchmark`, but `microbenchmark` seems to be more popular. They will give roughly equivalent results. – Ben Bolker Aug 20 '19 at 22:23
  • 1
    by the way, I think `%>% head(1)` might be a *lot* faster than `%>% filter(row_number() == 1)` ... and you can probably do the last step with an appropriate join ... – Ben Bolker Aug 20 '19 at 22:42
  • Thank you for the info Ben. I will remove the first two block of text from the question and look into the benchmarking. I am still at work now, so it will probably be this evening that I return and edit with the benchmarking. I also like the advice about head(1). – Prince M Aug 20 '19 at 22:47

2 Answers2

1

I believe most of the increased time is due to creating the new column. As you can see, M1 and M3 have similar times. Of course that difference of ~2 milliseconds between M1 and M3 will multiply based on the data size

library(tidyverse)
library(microbenchmark)

set.seed(42)
n = 1e5
d = seq.Date(Sys.Date() - 10000, Sys.Date(), 1)
x = sample(d, n, TRUE)
y = sample(d, n, TRUE)
df1 = data.frame(x, y, id = sample(LETTERS, n, TRUE))

microbenchmark(M1 = {
    df1 %>%
        group_by(id) %>%
        mutate(chk = x - y) %>%
        filter(chk < 0) %>%
        filter(row_number() == 1)
},
M2 = {
    df1 %>%
        group_by(id) %>%
        filter(x < y) %>%
        filter(row_number() == 1)
},
M3 = {
    df1 %>%
        group_by(id) %>%
        mutate(chk = x - y) %>%
        filter(x < y) %>%
        filter(row_number() == 1)
})
#Unit: milliseconds
# expr       min        lq      mean    median       uq       max neval
#   M1 13.130673 13.405151 15.088266 14.096772 15.56080 22.636533   100
#   M2  5.931192  6.208457  6.564363  6.402879  6.71973  9.354252   100
#   M3 11.360640 11.607993 12.449220 12.001383 12.57732 18.260131   100

For the point about comparing difftime to numeric, there doesn't seem to be a lot of difference

library(microbenchmark)

set.seed(42)
n = 1e7
x = sample(d, n, TRUE)
y = sample(d, n, TRUE)
df1 = data.frame(x, y)
df1$difference = df1$x - df1$y

class(df1$difference)
#[1] "difftime"

microbenchmark(date_vs_date = {
    df1 %>% filter(x < y)
},
date_vs_numeric ={
    df1 %>% filter(difference < 0)
})
#Unit: milliseconds
#            expr      min       lq     mean   median       uq      max neval
#    date_vs_date 177.1789 222.4112 243.6617 233.7221 244.2765 430.4273   100
# date_vs_numeric 181.6222 217.1121 251.6127 232.7213 249.8218 455.8285   100
d.b
  • 32,245
  • 6
  • 36
  • 77
1

My explorations so far, with a simplified example and a slightly smaller data set (only a million rows and a minimal subset of columns) have the individual tests (test_cf for filtering on the checkfield variable, test_lt for filtering on the date comparison) taking about the same time, which both take about the same time as creating the checkfield column. Doing both at once (comb, creating and comparing) takes 2.5 x longer, not sure why.

Perhaps you can use this as a starting point for bisecting/benchmarking to find the culprit.

     test elapsed relative
2    comb   5.557    2.860
1 make_cf   1.943    1.000
4 test_cf   2.122    1.092
3 test_lt   2.109    1.085

I used rbenchmark::benchmark() because I prefer the output format: microbenchmark::microbenchmark() might be more accurate (but I would be surprised if it made a big difference).

code

library(dplyr)
n <- 1e6 ## 5653380 in orig; reduce size for laziness
set.seed(101)
## sample random dates, following
##  https://stackoverflow.com/questions/21502332/generating-random-dates
f <- function(n)
    sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"),
           replace=TRUE,
           size=n)
dd <- tibble(
    date_col1=f(n),
    date_col2=f(n)
 ## set up checkfield so we can use it without creating it
) %>% mutate(cf=date_col1-date_col2)

Benchmark:

library(rbenchmark)
benchmark(
    make_cf=dd %>% mutate(checkfield=date_col1-date_col2),
    comb=dd %>% mutate(checkfield=date_col1-date_col2) %>% filter(checkfield<0),
    test_lt=dd %>% filter(date_col1<date_col2),
    test_cf=dd %>% filter(cf<0),
    columns=c("test","elapsed","relative")
)
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453