Using for
loops in R can be notoriously slow, but there are any number of built in R functions that will improve your performance. My favorite would be to use ifelse
:
country_check <- d$countryname %in% c("Italy","Spain","Canada","Brazil","United States","France","Mexico","Colombia","Peru","Chile","Argentina","Ecuador")
d$countryname <- factor(ifelse(country_check, d$countryname, "Others"))
Testing this against looping:
test <- data.frame(abc = factor(sample(letters, 100000, replace = TRUE)))
g <- function() {
test$log <- test$abc %in% c("a", "e", "i", "o", "u")
test$abc <- ifelse(test$log, test$abc, "x")
}
f <- function() {
for(i in 1:dim(test)[1]) {
if(test$abc[i] %in% c("a", "e", "i", "o", "u"))
{next}
else
{test$abc[i] <- "x"}
}}
> system.time(g())
user system elapsed
0.04 0.00 0.05
> system.time(f())
user system elapsed
22.51 7.78 30.57
This is a substantial improvement, though there are probably solutions out there that do even better. My fragile little computer can't handle the loop at more than about 100,000 rows in the data frame, so I can't give you decent benchmarks for a real-sized example.
Using built in functions which keep their guts hidden in C code will generally get you much better performance results than doing all your hard work in R.