The code below changes values in column $type, based on values in column $weight.
n <- 1e3; m <- n*10
Treshold <- 50
wts <-runif(m)
df <- data.frame(id=seq_len(m), weight=wts * 100, type='L')
library(microbenchmark)
microbenchmark(
"df-col-row" = (df$type[df$weight < Treshold] <- "M"),
"df-row-col" = (df[df$weight < Treshold, ]$type <- "M")
)
#
#Unit: microseconds
# expr min lq mean median uq max neval
# df-col-row 80.6 87.65 145.429 89.55 104.55 5109.1 100
# df-row-col 564.9 586.10 618.496 592.40 618.90 1601.0 100
Why is the first alternative faster than the second?
Update 1
As expected, the difference increases when more columns are added.
d9 <- data.frame(type='L', weight=wts * 100, c3=3, c4=4, c5=5, c6=6, c7=7, c8=8, c9=9)
microbenchmark(
"df-row-9col" = (d9[d9$weight < Treshold, ]$type <- "M")
)
# nit: microseconds
# expr min lq mean median uq max neval
# df-row-9col 950.1 1091.55 1267.982 1111.1 1172.45 5806 100
Update 2
In the first alternative, df
is copied once, in the second alternative twice.
tracemem(df)
df$type[df$weight < Treshold] <- "M" # Alt 1.
#tracemem[0x000002c92d2b87c8 -> 0x000002c92d2b9498]: $<-.data.frame $<-
df[df$weight < Treshold, ]$type <- "M" # Alt 2.
#tracemem[0x000002c92d2b9498 -> 0x000002c92d2b9ad8]:
#tracemem[0x000002c92d2b9ad8 -> 0x000002c92d2c47d8]: [<-.data.frame [<-
untracemem(df)