I have some rather simple R code that takes 10min-20min to execute which I believe to be unnecessary time consuming. The data consist of a data frame with approximately 30 columns and 500.000 rows. The aim of the loop is to look what kind of bin a certain value should be put in.
I have tried to do improve the code by adding the entire column before the loop, doing some calculations outside the loop after reading some other threads regarding the topic but none of these methods have improved the code significantly.
col_days <- Sys.Date() - as.Date(df$col)
i=1
while (i < length(df$col)){
if (Sys.Date() - as.Date(df$col[i]) <366){
df$col_bin[i] <- "Less than 1 year"
i=i+1
}
else if (between(Sys.Date() - as.Date(df$col[i]), 366, 1095)){
df$col_bin[i] <- "1 year to 3 years"
i=i+1
}
else if (between(Sys.Date() - as.Date(df$col[i]), 1096, 1825)){
df$col_bin[i] <- "3 years to 5 years"
i=i+1
}
else if (between(Sys.Date() - as.Date(df$col[i]), 1826, 3650)){
df$col_bin[i] <- "5 years to 10 years"
i=i+1
}
else{
df$col_bin[i] <- "More than 10 years"
i=i+1
}
}
So with this version of the code, it takes approximately 15 minutes to compute all rows. I believe that there are several ways to improve this. Suggestions?