0

I'm comparing how different data manipulation packages perform on certain operations at different size datasets.

I produce a dummy dataset (cartesian product of iris x iris. pretty meaningless, but essentially just a 22500 x 10 dataset).

library(dplyr)
library(data.table)
library(rbenchmark)

iris_big <- merge(x = iris, y = iris, by = NULL) 

iris_big_dt <- as.data.table(iris_big) #for data.table

benchmark("Base R" = {
            iris_big[base::order("Petal.Width.y")]
          },
          "dplyr" = {
            dplyr::arrange(iris_big,"Petal.Width.y")
          },          

          "data.table" = {
            data.table::setorder(iris_big_dt,"Petal.Width.y")
          },
          replications = 30,
          columns = c("test", "replications", "elapsed",
                      "relative", "user.self", "sys.self"))

OUTPUT::

| test       | replications   |elapsed|...|sys.self|
| --------   | -------------- |----   |---|---|
| Base R     | 30             |0.00   |...|0.00|
| data.table | 30             |0.04   |...|0.02|
| dplyr      | 30             |1.55   |...|0.00|

why is base R so fast? why is dplyr so slow? am I doing something wrong? thanks

Limey
  • 10,234
  • 2
  • 12
  • 32
  • 2
    Your experience matches mine. Base R is fastest, data.table is fast, the tidyverse is slower. But, unless you're dealing with very large datasets or frequent, complex manipulation, I don't see the differences as being important. aAfew millisecons (or even seconds) here and there don't really matter. I almost always use tidyverse, and accept the performance hit. My reason is simple: readability. I find the tidyverse far more intuitive and consistent than the other two. That's important both when debugging code when writing and when maintaining old code. I'll take the perfomance hit. – Limey Oct 29 '21 at 09:09
  • it is, for the most part, also my experience although it is not true in every instance - data.tables `fread()` function is faster than readrs `read_csv()` and both are markedly faster at reading in thousands of csv files when compared to base Rs `read.table()`. so sometimes you can get better results if you specify some of the optional arguments because the non-base r functions usually perform a couple of extra steps – D.J Oct 29 '21 at 09:20
  • 3
    The output of the 3 functions is shown here is not the same. The base R case results in a data.frame with 1 column, `data.table` and `dplyr` methods don't give the same ordering, i.e comparing different things – Donald Seinen Oct 29 '21 at 09:28
  • @DonaldSeinen I've been struggling to work out how I can write the base function to give me the whole dataframe - I dont know why it is only returning one column (admittedly I never write code in base R) dplyr and dataframe produce the same result. dplyr method is reading a dataframe, and data.table method is reading a datatable. this is by necessity, as it mimics how you would use these functions in real scenarios – aStarIsCorn Oct 29 '21 at 11:02
  • 1
    ***Correct is always better than Faster.*** If you believe that one of the expressions to compare is not returning what you expect, then benchmarking it is a futile waste of time. Don't rush head-first into benchmarks: when done correctly, they can still lead to uninformed decisions due to assumptions or mistakes. – r2evans Oct 29 '21 at 14:54
  • [Related with benchmark on larger data](https://stackoverflow.com/a/29331287/1851712) – Henrik Oct 30 '21 at 08:40
  • aStarIsCorn, you haven't spoken up since I posted the answer. Did you have comments/issues or does it answer your question? – r2evans Mar 30 '22 at 17:27

1 Answers1

9

Problem Identification

There is no verification that the outputs are correct and the same/equivalent between implementations: they are not. The first is a single (unsorted) column, and the second is just unsorted.

  • iris_big[..] (i.e., the base::[ primitive) without a comma is selecting columns, not rows. Add a trailing comma.

  • base::order("Petal.Width.y"), regardless if being within iris_big[..], is always returning the singular static 1 because it is sorting a character vector of length 1 (i.e., c("Petal.Width.y"), no care if it might reference column names in the enclosing frame). As such, it is returning the first column with no change to the row ordering. The fact that the dimensions of the return value are wrong should be a strong hint that this is broken. (Credit to @DonaldSeinen for the start of this commentary.)

    This is effectively one of

    iris_big[1]     # just the first column
    iris_big[1,]    # just the first row
    

    This is fixed with

    iris_big[base::order(iris_big$Petal.Width.y),]
    
  • Similarly, dplyr::arrange(iris_big, "Petal.Width.y") is broken in the same way. If we continue with a quick check to make sure that there is no decrease in that column, we'll see

    dplyr::arrange(iris_big, "Petal.Width.y") %>%
      summarize(nondecr = all(diff(Petal.Width.y) >= 0))
    #   nondecr
    # 1   FALSE
    

    This is fixed by unquoting it:

    dplyr::arrange(iris_big, Petal.Width.y) %>%
    summarize(nondecr = all(diff(Petal.Width.y) >= 0))
    #   nondecr
    # 1    TRUE
    

The "quoted" problem with the base and dplyr variants is confounded by the fact that base R is not using non-standard evaluation (NSE), dplyr requires NSE in arrange, and data.table::setorder appears to use either quoted or unquoted (despite its instructions "Do not quote column names" in ?setorder).

(The first bullet of a missing comma , is also confounded by data.table, since iris_big_dt[1] returns the first row, not the first column. While I think I understand some of the initial impetus for this design choice, I've always been of the opinion that this is a broken shortcut: it may save thousands (?) of otherwise-unnecessary commas every year, but at the expense of ambiguity when reading base/data.table code.)

Verification of Correctness/Sameness

An important check in benchmarking is that the results are (1) all correct, and (2) the same. Checking each individually, we see:

ret1wrong1 <- iris_big[base::order("Petal.Width.y")]
ret1wrong2 <- iris_big[base::order("Petal.Width.y"),]      # add comma
ret1 <- iris_big[base::order(iris_big$Petal.Width.y),]     # unquote, add comma
ret2wrong <- dplyr::arrange(iris_big, "Petal.Width.y")
ret2 <- dplyr::arrange(iris_big, Petal.Width.y)            # unquote
ret3 <- data.table::setorder(iris_big_dt, "Petal.Width.y")

range(iris_big$Petal.Width.y) # informative
# [1] 0.1 2.5

head(ret1wrong1)          # wrong, single column
#   Sepal.Length.x
# 1            5.1
# 2            4.9
# 3            4.7
# 4            4.6
# 5            5.0
# 6            5.4
ret1wrong2                # wrong, single row
#   Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1            5.1           3.5            1.4           0.2    setosa            5.1           3.5            1.4           0.2    setosa
head(ret1)                # CORRECT
#      Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1351            5.1           3.5            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1352            4.9           3.0            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1353            4.7           3.2            1.3           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1354            4.6           3.1            1.5           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1355            5.0           3.6            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 1356            5.4           3.9            1.7           0.4    setosa            4.9           3.1            1.5           0.1    setosa
all(diff(ret1$Petal.Width.y) >= 0)
# [1] TRUE

head(ret2wrong)           # first petal.Width.y is 0.2 not 0.1
#   Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1            5.1           3.5            1.4           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 2            4.9           3.0            1.4           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 3            4.7           3.2            1.3           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 4            4.6           3.1            1.5           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 5            5.0           3.6            1.4           0.2    setosa            5.1           3.5            1.4           0.2    setosa
# 6            5.4           3.9            1.7           0.4    setosa            5.1           3.5            1.4           0.2    setosa
all(diff(ret2wrong$Petal.Width.y) >= 0)
# [1] FALSE
head(ret2)                # CORRECT
#   Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
# 1            5.1           3.5            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 2            4.9           3.0            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 3            4.7           3.2            1.3           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 4            4.6           3.1            1.5           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 5            5.0           3.6            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 6            5.4           3.9            1.7           0.4    setosa            4.9           3.1            1.5           0.1    setosa
all(diff(ret2$Petal.Width.y) >= 0)
# [1] TRUE

head(ret3)
#    Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y
#             <num>         <num>          <num>         <num>    <fctr>          <num>         <num>          <num>         <num>    <fctr>
# 1:            5.1           3.5            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 2:            4.9           3.0            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 3:            4.7           3.2            1.3           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 4:            4.6           3.1            1.5           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 5:            5.0           3.6            1.4           0.2    setosa            4.9           3.1            1.5           0.1    setosa
# 6:            5.4           3.9            1.7           0.4    setosa            4.9           3.1            1.5           0.1    setosa
all(diff(ret3$Petal.Width.y) >= 0)
# [1] TRUE

all.equal(ret1, ret2, check.attributes = FALSE)
# [1] TRUE
all.equal(ret1, ret3, check.attributes = FALSE)
# [1] TRUE

(We need check.attributes=FALSE because it will otherwise complain about row-names and class differences, which are not critical in the comparison of the data.)

Corrected Benchmark

Now that we're assured of equality, let's benchmark them:

iris_big_dt1 <- as.data.table(iris_big) #for data.table
iris_big_dt2 <- as.data.table(iris_big) #for data.table

bench::mark(
  "Base R" = {
    iris_big[base::order(iris_big$Petal.Width.y),]
  },
  "dplyr" = {
    dplyr::arrange(iris_big, Petal.Width.y)
  },
  "data.table 1" = {
    data.table::setorder(iris_big_dt1, "Petal.Width.y")
  },
  "data.table 2" = {
    data.table::setorder(copy(iris_big_dt2), "Petal.Width.y")
  },
  min_iterations = 1000,
  check = FALSE)
# # A tibble: 4 x 13
#   expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                  time             gc                  
#   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                  <list>           <list>              
# 1 Base R         3.33ms   3.61ms      262.    1.97MB    5.08    981    19      3.74s <NULL> <Rprofmem[,3] [13 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 2 dplyr          3.75ms   4.32ms      216.    1.63MB    3.74    983    17      4.55s <NULL> <Rprofmem[,3] [15 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>
# 3 data.table 1   1.19ms   1.37ms      713.   87.94KB    0.714   999     1       1.4s <NULL> <Rprofmem[,3] [1 x 3]>  <bch:tm [1,000]> <tibble [1,000 x 3]>
# 4 data.table 2   2.66ms   3.26ms      304.    1.84MB    5.56    982    18      3.23s <NULL> <Rprofmem[,3] [15 x 3]> <bch:tm [1,000]> <tibble [1,000 x 3]>

all(diff(iris_big_dt1$Petal.Width.y)>=0)
# [1] TRUE
all(diff(iris_big_dt2$Petal.Width.y)>=0)
# [1] FALSE

I included two versions of the data.table variant, since it might be questioned that sorting an already-sorted (due to its referential/in-place operation) table would be faster to sort a second time. Even imposing the overhead of copying the data every time, the data.table 2 variant is still clearly faster than both Base R and dplyr.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 4
    Thank you very much @r2evans for the time spent writing this very enlightening post that is as informative as it is necessary. – lovalery Oct 29 '21 at 15:16
  • 2
    If voting was based on detailing and comprehensiveness, r2evans should have atleast >400k rep points – akrun Oct 29 '21 at 17:42