4

I have a data frame:

id <- c(rep(1, 4), rep(2, 3), rep(3, 2), 4)
rate <- c(rep(1, 3), NA, 0.5, 0.6, NA, 0.7, NA, NA)
df <- data.frame(id, rate)

and I need to replace the NA based on the following conditions:

for (i in 1:dim(df)[1]) {
  if (is.na(df$rate[i])) {
    mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1)
    if (is.nan(mrate)) {
      df$rate[i] <- 1
    } else {
      df$rate[i] <- mrate
    }
  }
}

Apparently the for loop is simply too slow on a big data frame with >200K rows. How can I use a much faster way without using for loop?

Thanks!

Rock
  • 2,827
  • 8
  • 35
  • 47
  • 1
    fwiw, the `for` loop is not what is slow. Instead, my guess is that the slowest portion of your code is the subsetting operation performed when you determine the `mean`. If you pre-calculate each group mean using `aggregate` or another method, your loop will be significantly faster... But a `data.table` solution is almost surely cleaner and quicker still! – Justin May 22 '13 at 15:03

3 Answers3

5

This is a solution using data.tables:

library(data.table)
dt <- data.table( df, key = "id" )
dt[ , rate := ifelse( is.na(rate), round( mean(rate, na.rm=TRUE), 1), rate ), by = id ]
dt[ is.na(rate), rate := 1 ]
dt 
    id rate
 1:  1  1.0
 2:  1  1.0
 3:  1  1.0
 4:  1  1.0
 5:  2  0.5
 6:  2  0.6
 7:  2  0.6
 8:  3  0.7
 9:  3  0.7
10:  4  1.0

I am not sure though, if the ifelse could/should be avoided.

Beasterfield
  • 7,023
  • 2
  • 38
  • 47
4

As mentioned in my comment, for loops in R are not specifically slow. However, often a for loop indicates other inefficiencies in code. In this case, the subset operation that is repeated for each row to determine the mean is most likely the slowest bit of code.

for (i in 1:dim(df)[1]) {
  if (is.na(df$rate[i])) {
    mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1)  ## This line!
    if (is.nan(mrate)) {
      df$rate[i] <- 1
    } else {
      df$rate[i] <- mrate
    }
  }
}

If instead, these group averages are determined before hand, the loop can do a rapid lookup.

foo <- aggregate(df$rate, list(df$id), mean, na.rm=TRUE)
for (i in 1:dim(df)[1]) {
  if (is.na(df$rate[i])) {
    mrate <- foo$x[foo$Group.1 == df$id[i]]
...

However, I am still doing a subset at df$id[i] on the large data.frame. Instead, using one of the tools that implements a split-apply-combine strategy is a good idea. Also, lets write a function that takes a single value and a pre-computed group average and does the right thing:

myfun <- function(DF) {
  avg <- avgs$rate[avgs$id == unique(DF$id)]
  if (is.nan(avg)) {
    avg <- 1
  }
  DF$rate[is.na(DF$rate)] <- avg

  return (DF)
}

The plyr version:

 library(plyr)
 avgs <- ddply(df, .(id), summarise, rate=mean(rate, na.rm=TRUE))
 result <- ddply(df, .(id), myfun)

And the likely much faster data.table version:

 library(data.table)
 DT <- data.table(df)
 setkey(DT, id)

 DT[, avg := mean(rate, na.rm=TRUE), by=id]
 DT[is.nan(avg), avg := 1]

 DT[, rate := ifelse(is.na(rate), avg, rate)]

This way, we've avoided all lookup subsetting in leiu of adding a pre-calculated column and can now do row-wise lookups which are fast and efficient. The extra column can be dropped inexpensively using:

DT[, avg := NULL]

The whole shebang can be written into a function or a data.table expression. But, IMO, that often comes at the expense of clarity!

Justin
  • 42,475
  • 9
  • 93
  • 111
3

I'm not sure this exactly answers the OP's question, but for others who read this later, there is a different and much faster method of performing calculations on a subset of data other than actually subsetting the data: vector math. The engineers in the crowd will know what I'm talking about.

Instead of subsetting, assign a very rapid function to create an identity vector and multiply the data by the identity.

Now, this isn't faster for all cases. There are cases where vectorized functions are actually slower than item-explicit functions, and it all depends on your specific application. [Insert O-notation rant of your choice here.]

Here is how we would do a vector math implementation for this case:

# Create the NA identity vector.
na_identity <- is.na(df$rate)

# Initialize the final data frame.
# This is for non-destructive purposes.
df_revised <- df

# Replace all NA occurrences in final
# data frame with zero values.
df_revised$rate[na_identity] <- 0

# Loop through each unique [id]
# value in the data.
# Create an identity vector for the
# current ID, calculate the mean
# rate for that ID (replacing NaN with 1),
# and insert the mean for any NA values
# associated with that ID.
for (i in unique(df$id)){
    id_identity <- df$id==i
    id_mean <- sum(df_revised$rate * id_identity * !na_identity) / sum(id_identity * !na_identity)
    if(is.nan(id_mean)){id_mean <- 1}
    df_revised$rate <- df_revised$rate + id_mean * id_identity * na_identity
}

#    id rate
# 1   1 1.00
# 2   1 1.00
# 3   1 1.00
# 4   1 1.00
# 5   2 0.50
# 6   2 0.60
# 7   2 0.55
# 8   3 0.70
# 9   3 0.70
# 10  4 1.00

From a vector math perspective, this code is easily readable. In this small example, the code is very fast, but the cycle time increases directly with the number of unique ID values. I'm not sure if this is the correct approach for the OP's larger application, but the solution is workable and theoretically sound and removes the need for complex and difficult to read logic blocks.

Dinre
  • 4,196
  • 17
  • 26