I'm very new to R, but I have a relatively simple mutate command which I'm running on a large dataset (4,000,000+ rows x 150 columns) which currently takes 15+ hours to run.
The query uses an if statement which only looks at a couple of columns - a numeric variable and a date (nb - some of the dates are NA) and returns a binary flag. Works fine on a small example but seems very inefficient on anything past 1m rows.
I'm using R v3.5, RStudio 1.2.1335 & dplyr 0.8.0.1
## create dummy dataframe
df <- data.frame(diff = c(35,432,-278,6556,90,10,76,-24,6),
date_a = as.Date(c("2017-03-21","2017-01-08",NA,"2015-12-01",
"2019-03-09",NA,"2018-09-06","2017-06-01",
"2018-05-30")))
library(dplyr)
df <- df %>%
mutate(flag = if_else(diff > 21 | is.na(date_a),1,0))
When i chunk the data down to a smaller size, the process takes:
- 20k rows = 1.9 secs
- 200k rows = 1.7 mins
- 500k rows = 10 mins
- 1m rows = 34 mins
And anything bigger than that is well over what I'm prepared to wait for this...