As mentioned in my comment, for
loops in R are not specifically slow. However, often a for
loop indicates other inefficiencies in code. In this case, the subset operation that is repeated for each row to determine the mean
is most likely the slowest bit of code.
for (i in 1:dim(df)[1]) {
if (is.na(df$rate[i])) {
mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1) ## This line!
if (is.nan(mrate)) {
df$rate[i] <- 1
} else {
df$rate[i] <- mrate
}
}
}
If instead, these group averages are determined before hand, the loop can do a rapid lookup.
foo <- aggregate(df$rate, list(df$id), mean, na.rm=TRUE)
for (i in 1:dim(df)[1]) {
if (is.na(df$rate[i])) {
mrate <- foo$x[foo$Group.1 == df$id[i]]
...
However, I am still doing a subset at df$id[i]
on the large data.frame. Instead, using one of the tools that implements a split-apply-combine strategy is a good idea. Also, lets write a function that takes a single value and a pre-computed group average and does the right thing:
myfun <- function(DF) {
avg <- avgs$rate[avgs$id == unique(DF$id)]
if (is.nan(avg)) {
avg <- 1
}
DF$rate[is.na(DF$rate)] <- avg
return (DF)
}
The plyr
version:
library(plyr)
avgs <- ddply(df, .(id), summarise, rate=mean(rate, na.rm=TRUE))
result <- ddply(df, .(id), myfun)
And the likely much faster data.table
version:
library(data.table)
DT <- data.table(df)
setkey(DT, id)
DT[, avg := mean(rate, na.rm=TRUE), by=id]
DT[is.nan(avg), avg := 1]
DT[, rate := ifelse(is.na(rate), avg, rate)]
This way, we've avoided all lookup subsetting in leiu of adding a pre-calculated column and can now do row-wise lookups which are fast and efficient. The extra column can be dropped inexpensively using:
DT[, avg := NULL]
The whole shebang can be written into a function or a data.table
expression. But, IMO, that often comes at the expense of clarity!