0

So I have the following for loop:

for(i in 1:dim(d)[1])
{
  if(d$countryname[i] %in% c("Italy","Spain","Canada","Brazil","United States","France","Mexico","Colombia","Peru","Chile","Argentina","Ecuador"))
   {next}
   else
   {d$countryname[i] <- "Others"}
}

The "d" dataframe has more than 6,5 million rows and d$countryname is a factor.

Is there a way to make this faster? It is very very slow. Thank you.

Roland
  • 127,288
  • 10
  • 191
  • 288
intael
  • 508
  • 2
  • 7
  • 21

3 Answers3

4

Work on the levels:

x <- factor(c("a", "a", "b", "b", "c", "d"))
levels(x)[levels(x) %in% c("b", "d")] <- "other"
x
#[1] a     a     other other c     other
#Levels: a other c

This should be fast since it avoids scanning the whole vector. Of course, if you use package data.table you can be even faster.

Benchmarks

set.seed(42)
test <- data.frame(abc = factor(sample(letters, 6.5e6, replace = TRUE)))
#function by user164385
g <- function(test) {
  test$log <- test$abc %in% c("a", "e", "i", "o", "u")
  test$abc <- ifelse(test$log, test$abc, "x")
  test
}

rol <- function(test) {
  levels(test$abc)[levels(test$abc) %in% c("a", "e", "i", "o", "u")] <- "other"
  test
}

library(microbenchmark)
microbenchmark(test1 <- data.table:::copy(test), 
               {test1 <- test; g(test1)}, 
               {test1 <- test; rol(test)}, times = 5, unit = "ms")
#Unit: milliseconds
#                                expr         min          lq        mean      median          uq         max neval cld
#    test1 <- data.table:::copy(test)    5.645598    5.848151    6.044557    5.915754    5.964407    6.848877     5 a  
#  {     test1 <- test     g(test1) } 1966.524342 1971.394814 1988.507992 1978.835983 1987.284023 2038.500796     5   c
# {     test1 <- test     rol(test) }  141.646732  152.205054  154.106125  155.589032  159.307184  161.782623     5  b 
Roland
  • 127,288
  • 10
  • 191
  • 288
2

Using for loops in R can be notoriously slow, but there are any number of built in R functions that will improve your performance. My favorite would be to use ifelse:

country_check <- d$countryname %in% c("Italy","Spain","Canada","Brazil","United States","France","Mexico","Colombia","Peru","Chile","Argentina","Ecuador")
d$countryname <- factor(ifelse(country_check, d$countryname, "Others"))

Testing this against looping:

test <- data.frame(abc = factor(sample(letters, 100000, replace = TRUE)))
g <- function() {
   test$log <- test$abc %in% c("a", "e", "i", "o", "u")
   test$abc <- ifelse(test$log, test$abc, "x")
}
f <- function() {
    for(i in 1:dim(test)[1]) {
        if(test$abc[i] %in% c("a", "e", "i", "o", "u"))
        {next}
    else
    {test$abc[i] <- "x"}
}}

> system.time(g())
   user  system elapsed 
   0.04    0.00    0.05 
> system.time(f())
   user  system elapsed 
  22.51    7.78   30.57 

This is a substantial improvement, though there are probably solutions out there that do even better. My fragile little computer can't handle the loop at more than about 100,000 rows in the data frame, so I can't give you decent benchmarks for a real-sized example.

Using built in functions which keep their guts hidden in C code will generally get you much better performance results than doing all your hard work in R.

Empiromancer
  • 3,778
  • 1
  • 22
  • 53
  • You are mistaken about the performance of `apply` functions [[1](http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar)]. And this solution is relatively slow with 6.5 million elements. – Roland Feb 05 '16 at 15:53
  • @Roland regarding `apply`, thanks for the reference. Always good to keep learning new things. I've added some timing that compares `ifelse` to OP's `for` loop strategy, and demonstrates that it's a substantial improvement. I make no claims for it being the best solution out there, though. – Empiromancer Feb 05 '16 at 15:59
  • I've compare your solution to mine in my answer now. – Roland Feb 05 '16 at 16:02
  • @Roland Very elegant approach with base R functions; you've got my +1. – Empiromancer Feb 05 '16 at 16:54
1

how about:

log <- d$countryname %in% c("Italy","Spain","Canada","Brazil","United States","France","Mexico","Colombia","Peru","Chile","Argentina","Ecuador")

d$countryname[!log] <- "others"
David Heckmann
  • 2,899
  • 2
  • 20
  • 29