Efficient iterative processing of all data frame records in R

Question

I did search but probably not enough for efficient options to perform iterative calculation in R and I am conceding with this question. I am surprised that a simple iterative calculation in R takes such a long time for processing compared to SAS which I was primarily with until recently.

Here is my code to calculate rnk2 based on 4 different variables in a data frame.

new_rank2 <- function(rnk,penalty,min_rnk,max_rnk){
rnk2=max(min(rnk+penalty,max_rnk),min_rnk)
return(rnk2)
}
step4b <- step4[1:15000,]
for(i in 1:nrow(step4b)){
step4b$rnk2[i] <- new_rank2(step4b$rnk[i],step4b$penalty[i],step4b$min_rnk[i],step4b$max_rnk[i])
}

With this code, it takes about 32 seconds for 10k records, 75 seconds for 15k and 120 seconds for 20k records and I have about 400k records.

Another instance I need help with is conditional processing iteratively.

for (i in 1:nrow(data)) {
if  (data$V1[i]%in% c("A","B","E")) data$V3[i] = data$V4[i]
if  (data$V5[i]=="MED") data$V3[i] == 'XL'
}

The first should be easily vectorized using pmax() and pmin(). — joran, Aug 06 '14 at 04:16
I don't see any test cases. Why not present `dput( head(testcase))`. Naming your data as 'data' just makes people think less of you. It's as though you named your iteration variable 'for" in a for-loop. — IRTFM, Aug 06 '14 at 04:17
Possibly `step4b$rnk2 <- with(step4b, pmax(pmin(rnk+penalty, max_rnk), min_rnk) )` ? — thelatemail, Aug 06 '14 at 04:23
@joran thanks for the tip; I tried and failed vectorization using just max() and min() and I was not aware of pmin() and pmax(). I will certainly give that a try. — speedchase, Aug 06 '14 at 04:37
@thelatemail, thanks for the tip; I tried and failed vectorization using just max() and min() and I was not aware of pmin() and pmax(). I will certainly give that a try. — speedchase, Aug 06 '14 at 04:37
@BondedDust, I am not following your response about test cases, if you are referring to sample data, it fairly straight forward since I would like to change the value of V3 based on certain conditions. Like I am mentioned in my OP, this is fairly simple & fast in SAS but not so in R. I had 'data' as 'step5' in my code and I thought that would make people think less of me and hence I generalized it. — speedchase, Aug 06 '14 at 04:41
@BondedDust was asking you to provide a sample dataset so that we could test methods using [reproducible methods](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). This is a common issue on SO, and frankly it is frustrating to try to help without it. This is not the same as clearly stating your intent or algorithm (which you have done here), though these are also part of a good MWE (minimum working example). It is not safe to assume that those of us trying to help you will be able/willing to produce a dataset that meets all your needs. — r2evans, Aug 06 '14 at 05:51

r2evans · Accepted Answer · 2014-08-06T14:28:29.940

I generated a sample dataset to try some benchmarking for comparisons. (If it doesn't adequately represent your actual data, please let me know and I'll give it another run. I'm not certain it matters, since pmin and pmax tend to work equally well with numbers in the thousands and thousandths¹.)

new_rank2 <- function(rnk,penalty,min_rnk,max_rnk){
    rnk2 = max(min(rnk + penalty, max_rnk), min_rnk)
    return(rnk2)
}

set.seed(1)
n <- 20000
step4 <- data.frame(rnk = runif(n), penalty = runif(n),
                    min_rnk = runif(n), max_rnk = runif(n))
step4b <- step4c <- step4d <- step4

Basic performance of three methods. First, your iterative method:

system.time(
    for(i in 1:nrow(step4b)){
        step4b$rnk2[i] <- new_rank2(step4b$rnk[i], step4b$penalty[i],
                                    step4b$min_rnk[i], step4b$max_rnk[i])
    }
)
##     user  system elapsed 
##     3.40    0.00    3.41

Second, a vectorized method:

system.time(
    step4c$rnk2 <- with(step4, pmax(pmin(rnk + penalty, max_rnk), min_rnk))
)
##     user  system elapsed 
##     0.02    0.00    0.02

Third, a method utilizing Hadley Wickham's dplyr:

library(dplyr)
system.time(
    step4d <- step4 %>%
        mutate(rnk2 = pmax(pmin(rnk + penalty, max_rnk), min_rnk))
)
##     user  system elapsed 
##        0       0       0

Though I am not anywhere close to your record of 120 seconds for 20k records, I'm guessing that there is more computation than this rnk2 calculation. (BTW: my test computer is a 2+ year old i7 2.8GHz with 8GB of RAM running R-3.1.1 and cough cough win7.)

These methods are all producing identical results:

identical(step4b, step4c)
## [1] TRUE
identical(step4b, step4d)
## [1] TRUE

Since single-runs should not be trusted as absolute benchmarks, a more rigorous comparison might be insightful.

library(microbenchmark)
microbenchmark(
    iterative = {
        for(i in 1:nrow(step4b)){
            step4b$rnk2[i] <- new_rank2(step4b$rnk[i], step4b$penalty[i],
                                        step4b$min_rnk[i], step4b$max_rnk[i])
        }
    },
    vectorized = {
        step4c$rnk2 <- with(step4, pmax(pmin(rnk + penalty, max_rnk), min_rnk))
    },
    dplyr = {
        step4d <- step4 %>%
            mutate(rnk2 = pmax(pmin(rnk + penalty, max_rnk), min_rnk))
    }
)
## Unit: milliseconds
##        expr         min          lq      median          uq         max neval
##   iterative 3151.235603 3226.225834 3257.488366 3286.452867 3504.440315   100
##  vectorized    1.098110    1.159931    1.195153    1.247251    3.051811   100
##       dplyr    1.350165    1.418957    1.524622    1.604054    3.255437   100

In this loop alone, there is a difference of over three orders of magnitude in the test case, and that's using 20k records. The choice between using vectorized code or Hadley's dplyr is a personal choice, and is heavily influenced by the complexity of the code; in this case, I'd be hard-pressed to not use the vectorized code, but that's just me and this example.

For your second batch of code, first note that in your second if statement, you should replace the == with a single = or (some might argue "even better") <-. Change this:

if (data$V5[i] == "MED") data$V3[i] == 'XL'

to

if (data$V5[i] == "MED") data$V3[i] <- 'XL'

Otherwise, the "then" portion of the conditional, data$V3[i] == 'XL', reduces to FALSE and not an assignment of 'XL' into the data$V3[i] array element.

You can vectorize your for loop with something like this:

data$V3 <- NA
data$V3 <- ifelse(data$V1 %in% c('A', 'B', 'E'),
                  data$V4,
                  ifelse(data$V5 == 'MED',
                         'XL',
                         data$V3))

I first set $V3 to NA here primarily because I have no idea what is going on elsewhere; in reality, I'm assuming that it is already set to a sane value and you are conditionally changing it. This is still somewhat readable with a nested ifelse, but I'd guard against nesting more than this. If more conditioning is required, you might get better readability (and perhaps performance) out of something like:

idx <- data$V1 %in% c('A', 'B', 'E')
data$V3[idx] <- data$V4[idx]
idx <- (data$V5 == 'MED')
data$V3[idx] <- 'XL'
## ...

... though you will need to be careful if any of the tests allow a datum to match multiple times, and ergo the order of the comparisons affects the updates.

Footnote:

For clarity, I'm not saying that doing math in the thousands or the thousandths is equivalent, just the comparisons. There are efficiencies in multiplication (of numbers like 1e8) that make it ever-so-slightly faster than division (of the same OOM), but the comparison itself is equivalent.

This is awesome! Thanks for the detailed explanation. I will surely try your suggestions and report back. — speedchase, Aug 06 '14 at 07:01
THIS IS INSANE ! Thanks to r2evans both the suggested solutions were lightning fast. BTW, my computer is not that fast either; it is just an i5 with 4GB RAM. The first solution processed 400k records in 0.08 seconds and the 2nd one in 0.47 seconds — speedchase, Aug 06 '14 at 17:40

Efficient iterative processing of all data frame records in R

1 Answers1