In base R, why is selecting column, then filtering rows faster than vice versa: filter rows, then select column?

Question

The code below changes values in column $type, based on values in column $weight.

n <- 1e3; m <- n*10
Treshold <- 50
wts      <-runif(m)
df  <- data.frame(id=seq_len(m), weight=wts * 100, type='L')

library(microbenchmark)
microbenchmark(
"df-col-row" = (df$type[df$weight < Treshold]   <- "M"),
"df-row-col" = (df[df$weight < Treshold, ]$type <- "M")
)
#
#Unit: microseconds
#       expr   min     lq    mean median     uq    max neval
# df-col-row  80.6  87.65 145.429  89.55 104.55 5109.1   100
# df-row-col 564.9 586.10 618.496 592.40 618.90 1601.0   100

Why is the first alternative faster than the second?

Update 1
As expected, the difference increases when more columns are added.

d9  <- data.frame(type='L', weight=wts * 100, c3=3, c4=4, c5=5, c6=6, c7=7, c8=8, c9=9)
microbenchmark(
"df-row-9col" = (d9[d9$weight < Treshold, ]$type <- "M")
)
# nit: microseconds
#         expr   min      lq     mean median      uq  max neval
# df-row-9col 950.1 1091.55 1267.982 1111.1 1172.45 5806   100

Update 2
In the first alternative, df is copied once, in the second alternative twice.

tracemem(df)
df$type[df$weight < Treshold]   <- "M"    # Alt 1.
#tracemem[0x000002c92d2b87c8 -> 0x000002c92d2b9498]: $<-.data.frame $<- 

df[df$weight < Treshold, ]$type <- "M"    # Alt 2.
#tracemem[0x000002c92d2b9498 -> 0x000002c92d2b9ad8]: 
#tracemem[0x000002c92d2b9ad8 -> 0x000002c92d2c47d8]: [<-.data.frame [<-
untracemem(df)

This has to do how names or object are assigned to the memory, and how R uses strategies to save memory. Rule of thumb: columnwise calculations are faster then rowwise calculation because in columnwise operation the machine can read the entire column without having to skip around in memory. In rowwise calculations there are jump arounds in the memory therefore slower. If you really want to understand start reading this — TarJae, Mar 18 '23 at 11:01
@TarJae, I did. Indeed, in the first alternative, df is copied once, in the second alternative twice. — clp, Mar 19 '23 at 12:40

zephryl · Accepted Answer · 2023-03-18T22:31:50.560

Keep in mind that data.frames are built on lists. For instance, if you unclass(mtcars), you see it’s just a list of vectors:

$mpg
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

$cyl
 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

$disp
 [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0
[25] 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0

# ...

This makes it a bit more intuitive why it’s faster to select columns before filtering. It’s equivalent to selecting a subset of vectors from a list of vectors, then subsetting just those vectors. As opposed to subsetting every vector in the list, then keeping only some of them.

If you unclass() your example data and operate on the underlying list, we get similarly disparate performance, but it’s clearer why the second operation is more expensive:

set.seed(13)

library(microbenchmark)

undf <- unclass(df)

microbenchmark(
  "undf-col-row" = undf$type[undf$weight < Treshold],
  "undf-row-col" = lapply(undf, \(x) x[undf$weight < Treshold])$type
)

Unit: microseconds
         expr    min       lq     mean   median       uq       max neval cld
 undf-col-row  72.18  89.1200 103.0429 101.5650 112.1550   254.341   100  a 
 undf-row-col 201.87 269.9105 433.0011 282.4355 310.7255 14712.019   100   b

Indeed as expected, the difference increases when more columns are added. — clp, Mar 18 '23 at 12:41

score 3 · Answer 2 · answered Mar 18 '23 at 11:25

In the second version, you first subset the whole frame, then take the column type and assign values just to that. As an interpreted language, R does not "know" it just needs the one column. The more efficient way is to just take the column, subset it and assign the values. Other languages like SQL or Haskell have optimizers that would notice and make that change, but R as an interpreted language does not.

If you are unsure about the execution order of a piece of code, you can look at the abstract syntax tree. There you can see that the second expression does a more expensive subset of the whole data frame:

library(lobstr)

ast(df$type[df$weight < Treshold]   <- "M")
#> █─`<-` 
#> ├─█─`[` 
#> │ ├─█─`$` 
#> │ │ ├─df 
#> │ │ └─type 
#> │ └─█─`<` 
#> │   ├─█─`$` 
#> │   │ ├─df 
#> │   │ └─weight 
#> │   └─Treshold 
#> └─"M"
ast(df[df$weight < Treshold, ]$type <- "M")
#> █─`<-` 
#> ├─█─`$` 
#> │ ├─█─`[` 
#> │ │ ├─df 
#> │ │ ├─█─`<` 
#> │ │ │ ├─█─`$` 
#> │ │ │ │ ├─df 
#> │ │ │ │ └─weight 
#> │ │ │ └─Treshold 
#> │ │ └─`` 
#> │ └─type 
#> └─"M"

In base R, why is selecting column, then filtering rows faster than vice versa: filter rows, then select column?

2 Answers2

Linked