Benchmark
Out of burning curiosity, I just ran a benchmark with some of the approaches that I want to share with you. I couldn't quite believe some of the statements about performance in the answers and am trying to clarify this herewith. In order not to be misled by different rows/columns ratios, I consider three scenarios:
ncol == nrow
ncol << nrow
ncol >> nrow.
It might be beneficial to coerce as.matrix
beforehand, so I included this as an additional solution (unlist_mat).
microbenchmark::microbenchmark(
lapply=Dat1[col_set] <- lapply(Dat1[col_set], function(x) Look$class[match(x, Look$pet)]),
unlist=Dat2[col_set] <- Look$class[match(unlist(Dat2[col_set]), Look$pet)],
unlist_mat=Mat[, col_set] <- Look$class[match(as.vector(Mat[, col_set]), Look$pet)], ## added
ifelse=Dat3[col_set] <- ifelse(Dat3[col_set] == "cat", "mammal",
ifelse(Dat3[col_set] == "lizard", "reptile",
ifelse(Dat3[col_set] == "parrot", "bird", NA))),
look_vec=Dat4[] <- lapply(Dat4, function(i) look[i]),
times=3L
)
## 1e3 x 1e3
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# lapply 40.42905 63.47053 78.03831 86.51201 96.84294 107.17387 3 a
# unlist 513.25197 540.55981 656.25420 567.86766 727.75531 887.64297 3 b
# unlist_mat 45.91743 56.51087 68.50595 67.10432 79.80021 92.49611 3 a
# ifelse 117.83513 153.23771 366.16708 188.64030 490.33306 792.02581 3 ab
# look_vec 58.54449 88.40293 112.91165 118.26137 140.09522 161.92908 3 a
## 1e4 x 1e4
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 2.427077 3.558234 3.992481 4.689390 4.775183 4.860977 3 a
# unlist 73.125989 79.203107 94.027433 85.280225 104.478155 123.676084 3 b
# unlist_mat 4.940254 5.011684 5.576553 5.083114 5.894703 6.706291 3 a
# ifelse 9.714553 14.444899 36.176777 19.175244 49.407889 79.640535 3 a
# look_vec 8.460969 8.558600 8.784463 8.656230 8.946209 9.236188 3 a
## 1e5 x 1e3
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 2.314427 2.403001 3.270708 2.491575 3.748848 5.006120 3 a
# unlist 64.098825 66.850221 81.402676 69.601616 90.054601 110.507586 3 b
# unlist_mat 5.018869 5.060865 5.638499 5.102861 5.948314 6.793767 3 a
# ifelse 6.244744 16.488266 39.208119 26.731788 55.689807 84.647825 3 ab
# look_vec 4.512672 6.434651 7.496267 8.356630 8.988064 9.619498 3 a
## 1e3 x 1e5
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 52.833019 55.373432 71.308981 57.913845 80.546963 103.180080 3 ab
# unlist 164.901805 168.710285 186.454796 172.518765 197.231292 221.943819 3 c
# unlist_mat 3.872551 4.422904 4.695393 4.973257 5.106814 5.240372 3 a
# ifelse 72.592437 76.473418 103.930063 80.354399 119.598876 158.843354 3 b
# look_vec 56.444824 58.904604 62.677267 61.364383 65.793488 70.222593 3 ab
Note: Performed on an Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz using R --vanilla
.
all(sapply(list(Dat2, as.data.frame(Mat), Dat3, Dat4), identical, Dat1)) ## *
# [1] TRUE
## *manipulate the data first outside the benchmark, of course!
Conclusion
Using lapply
with a lookup matrix appears to be a good choice if the number of columns is rather low/lower than the number of rows. If we have many columns, especially compared to rows, we might benefit from coercing the respective columns of the data frame into a matrix first, which should only take a blink of an eye.
set.seed(42)
n <- 1e4; m <- 1e4
Dat <- data.frame(matrix(sample(c("cat", "lizard", "parrot"), n*m, replace=TRUE), n, m))
Look <- structure(list(pet = c("cat", "lizard", "parrot"), class = c("mammal", "reptile", "bird")),
class = "data.frame", row.names = c("1", "2", "3"))
look <- setNames(as.character(Look$class), Look$pet)
col_set <- names(Dat)
system.time(
Mat <- as.matrix(Dat)
)
# user system elapsed
# 0.844 0.318 1.161
Dat1 <- Dat2 <- Dat3 <- Dat4 <- Dat