Problem Identification
There is no verification that the outputs are correct and the same/equivalent between implementations: they are not. The first is a single (unsorted) column, and the second is just unsorted.
iris_big[..]
(i.e., the base::[
primitive) without a comma is selecting columns, not rows. Add a trailing comma.
base::order("Petal.Width.y")
, regardless if being within iris_big[..]
, is always returning the singular static 1
because it is sorting a character vector of length 1 (i.e., c("Petal.Width.y")
, no care if it might reference column names in the enclosing frame). As such, it is returning the first column with no change to the row ordering. The fact that the dimensions of the return value are wrong should be a strong hint that this is broken. (Credit to @DonaldSeinen for the start of this commentary.)
This is effectively one of
iris_big[1] # just the first column
iris_big[1,] # just the first row
This is fixed with
iris_big[base::order(iris_big$Petal.Width.y),]
Similarly, dplyr::arrange(iris_big, "Petal.Width.y")
is broken in the same way. If we continue with a quick check to make sure that there is no decrease in that column, we'll see
dplyr::arrange(iris_big, "Petal.Width.y") %>%
summarize(nondecr = all(diff(Petal.Width.y) >= 0))
# nondecr
# 1 FALSE
This is fixed by unquoting it:
dplyr::arrange(iris_big, Petal.Width.y) %>%
summarize(nondecr = all(diff(Petal.Width.y) >= 0))
# nondecr
# 1 TRUE
The "quoted" problem with the base and dplyr variants is confounded by the fact that base R is not using non-standard evaluation (NSE), dplyr
requires NSE in arrange
, and data.table::setorder
appears to use either quoted or unquoted (despite its instructions "Do not quote column names" in ?setorder
).
(The first bullet of a missing comma ,
is also confounded by data.table
, since iris_big_dt[1]
returns the first row, not the first column. While I think I understand some of the initial impetus for this design choice, I've always been of the opinion that this is a broken shortcut: it may save thousands (?) of otherwise-unnecessary commas every year, but at the expense of ambiguity when reading base/data.table code.)
Verification of Correctness/Sameness
An important check in benchmarking is that the results are (1) all correct, and (2) the same. Checking each individually, we see:
ret1wrong1 <- iris_big[base::order("Petal.Width.y")]
ret1wrong2 <- iris_big[base::order("Petal.Width.y"),] # add comma
ret1 <- iris_big[base::order(iris_big$Petal.Width.y),] # unquote, add comma
ret2wrong <- dplyr::arrange(iris_big, "Petal.Width.y")
ret2 <- dplyr::arrange(iris_big, Petal.Width.y) # unquote
ret3 <- data.table::setorder(iris_big_dt, "Petal.Width.y")
range(iris_big$Petal.Width.y) # informative
# [1] 0.1 2.5
head(ret1wrong1) # wrong, single column
# Sepal.Length.x
# 1 5.1
# 2 4.9
# 3 4.7
# 4 4.6
# 5 5.0
# 6 5.4
ret1wrong2 # wrong, single row
# Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa
head(ret1) # CORRECT
# Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1351 5.1 3.5 1.4 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 1352 4.9 3.0 1.4 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 1353 4.7 3.2 1.3 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 1354 4.6 3.1 1.5 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 1355 5.0 3.6 1.4 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 1356 5.4 3.9 1.7 0.4 setosa 4.9 3.1 1.5 0.1 setosa
all(diff(ret1$Petal.Width.y) >= 0)
# [1] TRUE
head(ret2wrong) # first petal.Width.y is 0.2 not 0.1
# Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa 5.1 3.5 1.4 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa 5.1 3.5 1.4 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa 5.1 3.5 1.4 0.2 setosa
all(diff(ret2wrong$Petal.Width.y) >= 0)
# [1] FALSE
head(ret2) # CORRECT
# Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1 5.1 3.5 1.4 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 2 4.9 3.0 1.4 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 3 4.7 3.2 1.3 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 4 4.6 3.1 1.5 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 5 5.0 3.6 1.4 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 6 5.4 3.9 1.7 0.4 setosa 4.9 3.1 1.5 0.1 setosa
all(diff(ret2$Petal.Width.y) >= 0)
# [1] TRUE
head(ret3)
# Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# <num> <num> <num> <num> <fctr> <num> <num> <num> <num> <fctr>
# 1: 5.1 3.5 1.4 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 2: 4.9 3.0 1.4 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 3: 4.7 3.2 1.3 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 4: 4.6 3.1 1.5 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 5: 5.0 3.6 1.4 0.2 setosa 4.9 3.1 1.5 0.1 setosa
# 6: 5.4 3.9 1.7 0.4 setosa 4.9 3.1 1.5 0.1 setosa
all(diff(ret3$Petal.Width.y) >= 0)
# [1] TRUE
all.equal(ret1, ret2, check.attributes = FALSE)
# [1] TRUE
all.equal(ret1, ret3, check.attributes = FALSE)
# [1] TRUE
(We need check.attributes=FALSE
because it will otherwise complain about row-names and class differences, which are not critical in the comparison of the data.)
Corrected Benchmark
Now that we're assured of equality, let's benchmark them:
iris_big_dt1 <- as.data.table(iris_big) #for data.table
iris_big_dt2 <- as.data.table(iris_big) #for data.table
bench::mark(
"Base R" = {
iris_big[base::order(iris_big$Petal.Width.y),]
},
"dplyr" = {
dplyr::arrange(iris_big, Petal.Width.y)
},
"data.table 1" = {
data.table::setorder(iris_big_dt1, "Petal.Width.y")
},
"data.table 2" = {
data.table::setorder(copy(iris_big_dt2), "Petal.Width.y")
},
min_iterations = 1000,
check = FALSE)
# # A tibble: 4 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 Base R 3.33ms 3.61ms 262. 1.97MB 5.08 981 19 3.74s <NULL> <Rprofmem[,3] [13 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 2 dplyr 3.75ms 4.32ms 216. 1.63MB 3.74 983 17 4.55s <NULL> <Rprofmem[,3] [15 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 3 data.table 1 1.19ms 1.37ms 713. 87.94KB 0.714 999 1 1.4s <NULL> <Rprofmem[,3] [1 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 4 data.table 2 2.66ms 3.26ms 304. 1.84MB 5.56 982 18 3.23s <NULL> <Rprofmem[,3] [15 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
all(diff(iris_big_dt1$Petal.Width.y)>=0)
# [1] TRUE
all(diff(iris_big_dt2$Petal.Width.y)>=0)
# [1] FALSE
I included two versions of the data.table
variant, since it might be questioned that sorting an already-sorted (due to its referential/in-place operation) table would be faster to sort a second time. Even imposing the overhead of copy
ing the data every time, the data.table 2
variant is still clearly faster than both Base R
and dplyr
.