I generated a sample dataset to try some benchmarking for comparisons. (If it doesn't adequately represent your actual data, please let me know and I'll give it another run. I'm not certain it matters, since pmin
and pmax
tend to work equally well with numbers in the thousands and thousandths1.)
new_rank2 <- function(rnk,penalty,min_rnk,max_rnk){
rnk2 = max(min(rnk + penalty, max_rnk), min_rnk)
return(rnk2)
}
set.seed(1)
n <- 20000
step4 <- data.frame(rnk = runif(n), penalty = runif(n),
min_rnk = runif(n), max_rnk = runif(n))
step4b <- step4c <- step4d <- step4
Basic performance of three methods. First, your iterative method:
system.time(
for(i in 1:nrow(step4b)){
step4b$rnk2[i] <- new_rank2(step4b$rnk[i], step4b$penalty[i],
step4b$min_rnk[i], step4b$max_rnk[i])
}
)
## user system elapsed
## 3.40 0.00 3.41
Second, a vectorized method:
system.time(
step4c$rnk2 <- with(step4, pmax(pmin(rnk + penalty, max_rnk), min_rnk))
)
## user system elapsed
## 0.02 0.00 0.02
Third, a method utilizing Hadley Wickham's dplyr
:
library(dplyr)
system.time(
step4d <- step4 %>%
mutate(rnk2 = pmax(pmin(rnk + penalty, max_rnk), min_rnk))
)
## user system elapsed
## 0 0 0
Though I am not anywhere close to your record of 120 seconds for 20k records, I'm guessing that there is more computation than this rnk2
calculation. (BTW: my test computer is a 2+ year old i7 2.8GHz with 8GB of RAM running R-3.1.1 and cough cough win7.)
These methods are all producing identical results:
identical(step4b, step4c)
## [1] TRUE
identical(step4b, step4d)
## [1] TRUE
Since single-runs should not be trusted as absolute benchmarks, a more rigorous comparison might be insightful.
library(microbenchmark)
microbenchmark(
iterative = {
for(i in 1:nrow(step4b)){
step4b$rnk2[i] <- new_rank2(step4b$rnk[i], step4b$penalty[i],
step4b$min_rnk[i], step4b$max_rnk[i])
}
},
vectorized = {
step4c$rnk2 <- with(step4, pmax(pmin(rnk + penalty, max_rnk), min_rnk))
},
dplyr = {
step4d <- step4 %>%
mutate(rnk2 = pmax(pmin(rnk + penalty, max_rnk), min_rnk))
}
)
## Unit: milliseconds
## expr min lq median uq max neval
## iterative 3151.235603 3226.225834 3257.488366 3286.452867 3504.440315 100
## vectorized 1.098110 1.159931 1.195153 1.247251 3.051811 100
## dplyr 1.350165 1.418957 1.524622 1.604054 3.255437 100
In this loop alone, there is a difference of over three orders of magnitude in the test case, and that's using 20k records. The choice between using vectorized code or Hadley's dplyr
is a personal choice, and is heavily influenced by the complexity of the code; in this case, I'd be hard-pressed to not use the vectorized code, but that's just me and this example.
For your second batch of code, first note that in your second if
statement, you should replace the ==
with a single =
or (some might argue "even better") <-
. Change this:
if (data$V5[i] == "MED") data$V3[i] == 'XL'
to
if (data$V5[i] == "MED") data$V3[i] <- 'XL'
Otherwise, the "then" portion of the conditional, data$V3[i] == 'XL'
, reduces to FALSE
and not an assignment of 'XL'
into the data$V3[i]
array element.
You can vectorize your for
loop with something like this:
data$V3 <- NA
data$V3 <- ifelse(data$V1 %in% c('A', 'B', 'E'),
data$V4,
ifelse(data$V5 == 'MED',
'XL',
data$V3))
I first set $V3
to NA
here primarily because I have no idea what is going on elsewhere; in reality, I'm assuming that it is already set to a sane value and you are conditionally changing it. This is still somewhat readable with a nested ifelse
, but I'd guard against nesting more than this. If more conditioning is required, you might get better readability (and perhaps performance) out of something like:
idx <- data$V1 %in% c('A', 'B', 'E')
data$V3[idx] <- data$V4[idx]
idx <- (data$V5 == 'MED')
data$V3[idx] <- 'XL'
## ...
... though you will need to be careful if any of the tests allow a datum to match multiple times, and ergo the order of the comparisons affects the updates.
Footnote:
- For clarity, I'm not saying that doing math in the thousands or the thousandths is equivalent, just the comparisons. There are efficiencies in multiplication (of numbers like 1e8) that make it ever-so-slightly faster than division (of the same OOM), but the comparison itself is equivalent.