My question: I was debugging some code at work, running it block-by-block, when I realized a small block was taking an unusual amount of time. I killed it and made a minor (but logically equivalent) tweak, and it ran almost instantly. I would like to understand why. The following code is in R, however, I imagine the answer may not be specific to R, and may apply to most programming languages of a similar paradigm or 'method-of-compiling'?
The code & information:
Using R version 3.6.1
Libraries loaded: dplyr, knitr, DataExplorer, glue, zoo
old_df is data frame of 5653380 obs. of 91 variables.
field1 is a col of policy numbers with class "character". Not unique, each occurs many times.
date_col1 and date_col2 are columns with class "date".
Method 1:
new_df <- old_df %>%
group_by(field1) %>%
mutate(checkfield = date_col1 - date_col2) %>%
filter(checkfield < 0) %>%
filter(row_number() == 1)
old_df$filter <- ifelse(old_df$field1 %in% new_df$field1,1,0)
Method 2:
new_df <- old_df %>%
group_by(field1) %>%
filter(date_col1 < date_col2) %>%
filter(row_number() == 1)
old_df$filter <- ifelse(old_df$field1 %in% new_df$field1,1, 0)
As you can probably see, the intended output of both methods is to add a flag, "1", in the column "filter" for policy numbers where date_col1 < date_col2. I did not write method 1, and my goal in writing method 2 was to change it as little as possible while also making it faster, so please avoid spending too much time talking about problems with method 1 that are not related to why it is unbearably slower than method 2. Feel free to mention such things, but I would like the crux to be why method 1 was taking 20, 30 minutes etc. For example, I believe in method 1, the first filter call could be above the group_by call. This might increase speed by an unnoticeable amount. I am not too concerned about this.
My thoughts: Clearly method 2 might be a little faster because it avoids making the column "checkfield", but I dont think this is the issue, as I ran method 1 line by line, and it appears to be the line 'filter(checkfield < 0)' where things went awry. For testing, I defined two dates x,y and checked class(x-y) which returned "difftime". So in this filter call, we are comparing "difftime" to a "numeric". Perhaps this requires some type of type-juggling to make the comparison, where as method 2 compares a date object to a date object?
Let me know what you think! I am very curious about this.