5

The answer to this question (Unique sorted rows single column from R data.table) suggested three different ways to get a vector of sorted unique values from a data.table:

# 1
sort(salesdt[, unique(company)])
#2 
sort(unique(salesdt$company))
#3
salesdt[order(company), unique(company)]

Another answer suggested other sort options than lexicographical order:

salesdt[, .N, by = company][order(-N), company]
salesdt[, sum(sales), by = company][order(-V1), company]

The data.table was created by

library(data.table)
company <- c("A", "S", "W", "L", "T", "T", "W", "A", "T", "W")
item <- c("Thingy", "Thingy", "Widget", "Thingy", "Grommit", 
          "Thingy", "Grommit", "Thingy", "Widget", "Thingy")
sales <- c(120, 140, 160, 180, 200, 120, 140, 160, 180, 200)
salesdt <- data.table(company,item,sales) 

As always, if different options are available to choose from I started to wonder what the best solution would be, in particular if the data.table would be much larger. I have searched a bit on SO but haven't found a particular answer so far.

Uwe
  • 41,420
  • 11
  • 90
  • 134
  • You can time the code by using: `ptm <- proc.time()` `your code` `proc.time() - ptm` and hence you will find what is the fastest. (mind those are 3 lines of code, the comment shows it as one ... refer to http://www.ats.ucla.edu/stat/r/faq/timing_code.htm for more info). – Krug May 01 '16 at 00:10
  • 1
    @Gracos Thank you for your suggestion. Here, the `microbenchmark` package is the better choice as it allows for benchmarking several expressions in one step. – Uwe May 01 '16 at 01:06
  • You are right. `proc.time()` is just a simpler alternative. – Krug May 01 '16 at 01:13

3 Answers3

11

For benchmarking, a larger data.table is created with 1.000.000 rows:

n <- 1e6
set.seed(1234) # to reproduce the data
salesdt <- data.table(company = sample(company, n, TRUE), 
                      item = sample(item, n, TRUE), 
                      sales = sample(sales, n, TRUE))

For the sake of completeness also the variants

# 4
unique(sort(salesdt$company))
# 5
unique(salesdt[,sort(company)])

will be benchmarked although it seems to be obvious that sorting unique values should be faster than the other way around.

In addition, two other sort options from this answer are included:

# 6
salesdt[, .N, by = company][order(-N), company]
# 7
salesdt[, sum(sales), by = company][order(-V1), company]

Edit: Following from Frank's comment, I've included his suggestion:

# 8
salesdt[,logical(1), keyby = company]$company

Benchmarking, no key set

Benchmarking is done with help of the microbenchmark package:

timings <- microbenchmark::microbenchmark(
  sort(salesdt[, unique(company)]),
  sort(unique(salesdt$company)),
  salesdt[order(company), unique(company)],
  unique(sort(salesdt$company)),
  unique(salesdt[,sort(company)]),
  salesdt[, .N, by = company][order(-N), company],
  salesdt[, sum(sales), by = company][order(-V1), company],
  salesdt[,logical(1), keyby = company]$company
)

The timings are displayed with

ggplot2::autoplot(timings)

Please, note the reverse order in the chart (#1 at bottom, #8 at top).

enter image description here

As expected, variants #4 and #5 (unique after sort) are pretty slow. Edit: #8 is the fastest which confirms Frank's comment.

A bit of surprise to me was variant #3. Despite data.table's fast radix sort it is less efficient than #1 and #2. It seems to sort first and then to extract the unique values.

Benchmarking, data.table keyed by company

Motivated by this observation I repeated the benchmark with the data.table keyed by company.

setkeyv(salesdt, "company")

The timings show (please not the change in scale of the time axis) that #4 and #5 have been accelerated dramatically by keying. They are even faster than #3. Note that timings for variant #8 are included in the next section.

enter image description here

Benchmarking, keyed with a bit of tuning

Variant #3 still includes order(company) which isn't necessary if already keyed by company. So, I removed the unnecessary calls to order and sort from #3 and #5:

timings <- microbenchmark::microbenchmark(
  sort(salesdt[, unique(company)]),
  sort(unique(salesdt$company)),
  salesdt[, unique(company)],
  unique(salesdt$company),
  unique(salesdt[, company]),
  salesdt[, .N, by = company][order(-N), company],
  salesdt[, sum(sales), by = company][order(-V1), company],
  salesdt[,logical(1), keyby = company]$company
)

The timings now show variants #1 to #4 on the same level. Edit: Again, #8 (Frank's solution) is the fastests.

enter image description here

Caveat: The benchmarking is based on the original data which only includes 5 different letters as company names. It is likely that the result will look differently with a larger number of distinct company names. The results have been obtained with data.table v.1.9.7.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Huh, why are you sometimes taking `.N` or summing `sales`, which are totally unnecessary for finding the distinct values of `company`? The only reasonable comparison would be `timings <- microbenchmark::microbenchmark( salesdt[,logical(1), keyby=company]$company, sort(unique(salesdt$company)) ); ggplot2::autoplot(timings)` – Frank Apr 30 '16 at 16:28
  • Because it is not only about getting distinct values. There are other sorting options than ordering the company names alphabetically. May be, the reasoning behind this becomes clearer from http://stackoverflow.com/a/36951561/3817004. – Uwe Apr 30 '16 at 16:36
  • Ok, but (1) none of that comes up in your question here, which only pertains to the two things in my benchmark; (2) comparisons between lexicographic sorting and sorting based on aggregates like counts and sums make no sense, since obviously it takes longer when you have to compute those aggregates; and (3) is there really a question of how best to sort by counts/.N or sums? You only show one way of doing this and I think it probably is the best.... I suppose you could also `tapply`, `sort` and extract names, though that only makes sense because `company` happens to be a string. – Frank Apr 30 '16 at 16:41
  • Anyway, regarding the question posed here, you are missing `salesdt[,logical(1), keyby=company]$company`, which is fastest on my computer for your example. You can use any other length-one vector instead of `logical(1)`. – Frank Apr 30 '16 at 16:46
  • @Frank Thank you for your suggestion. I have repeated the benchmarks including your code. And - you're right! – Uwe May 01 '16 at 14:06
  • @Frank May I suggest that you post your solution as answer to the original [question](http://stackoverflow.com/questions/36936613/unique-sorted-rows-single-column-from-r-data-table)? – Uwe May 01 '16 at 14:54
  • You can strip out all of the stuff from `[.data.table` to increase the speed: `v = salesdt$company; tmp = data.table:::forderv(v, sort = FALSE, retGrp = TRUE); sort(v[tmp[attr(tmp, 'starts')]])`. And you can push it further by stripping out things from `forderv` as well. And then you'll hopefully realize that the speed increase you're seeing is mostly useless unless you're running this computation in a very big loop. Most of these methods will give very similar times for large data and a "times" argument of 2-3. Millisecond benchmarks are mostly useless. – eddi May 02 '16 at 19:00
  • @Frank I can't put your idea into practice. How should I rephrase your example: `unique(df@myvariable)`? I tried: `df[,logical(1), keyby = myvariable]$myvariable` and I get error: Error in `[.data.frame`(df, , logical(1), keyby = myvariable) : unused argument (keyby = myvariable)` – Przemyslaw Remin Apr 13 '21 at 14:23
  • @PrzemyslawRemin Not sure I follow. Your first piece of code should have $ instead of @, I think – Frank Apr 14 '21 at 15:32
  • @Frank It was only a typo in my comment. I present problem in a separate question: https://stackoverflow.com/q/67076995/1903793 – Przemyslaw Remin Apr 14 '21 at 16:57
1

Alternatively you could do the following:

library(data.table)
n <- 1e6
salesdt <- data.table(company = sample(company, n, TRUE), 
                      item = sample(item, n, TRUE), 
                      sales = sample(sales, n, TRUE))

ptm <- proc.time() 
sort(salesdt[, unique(company)])
proc.time() - ptm

ptm <- proc.time() 
sort(unique(salesdt$company))
proc.time() - ptm

ptm <- proc.time() 
salesdt[order(company), unique(company)]
proc.time() - ptm

Information provided by proc.time is not as thorough as microbenchmark, but it is simpler.

Output for the above is:

sort(salesdt[, unique(company)])
user  system elapsed 
0.05    0.02    0.06 

sort(unique(salesdt$company))
user  system elapsed 
0.01    0.01    0.03 

salesdt[order(company), unique(company)]
user  system elapsed 
0.03    0.02    0.05 

Where user time relates to code execution, system time to CPU, and elapsed time is the difference since starting the stopwatch (and will be equal to the sum of user and system times if code run altogether). (taken from http://www.ats.ucla.edu/stat/r/faq/timing_code.htm)

Krug
  • 1,003
  • 13
  • 33
  • Thank you very much for repeating the benchmarks with `proc.time`. Concerning the use of `microbenchmark` I'm following the advice of Hadly Wickham from his book [Advanced R](http://adv-r.had.co.nz/Performance.html#microbenchmarking). There is also a good explanation [here](http://stackoverflow.com/questions/28575697/what-is-the-difference-between-using-r-microbenchmark-and-system-time) on SO. – Uwe May 01 '16 at 14:47
0

Using kit::funique or collapse::funique might be a fast way to get a vector of sorted unique values.

Also it should be considered how many treads are used. Using one core in data.table:

setDTthreads(1)
bench::mark(
dt = x[,logical(1), keyby = company]$company,
base = sort(unique(x$company)),
collapse = collapse::funique(x$company, TRUE),
kit = sort(kit::funique(x$company)))
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#1 dt          130.4ms  141.5ms      7.17   49.63MB     2.39     3     1      419ms
#2 base        170.1ms  170.1ms      5.88  166.15MB    17.6      1     3      170ms
#3 collapse     51.8ms     52ms     19.1     2.49KB     0       10     0      524ms
#4 kit          17.6ms   17.8ms     56.0         0B     0       28     0      500ms

Using four cores in data.table:

setDTthreads(4)
bench::mark(
dt = x[,logical(1), keyby = company]$company,
base = sort(unique(x$company)),
collapse = collapse::funique(x$company, TRUE),
kit = sort(kit::funique(x$company)))
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#1 dt           50.2ms   51.1ms     18.9    49.63MB     3.78    10     2      529ms
#2 base        147.1ms  161.3ms      6.26  166.15MB     6.26     4     4      639ms
#3 collapse     51.8ms   52.5ms     19.1     2.49KB     0       10     0      524ms
#4 kit          17.7ms   17.8ms     55.9         0B     0       28     0      501ms

When using system.time the time used by each core is summed up:

setDTthreads(1)
system.time(x[,logical(1), keyby = company]$company)
#       User      System verstrichen 
#      0.122       0.004       0.126 

setDTthreads(4)
system.time(x[,logical(1), keyby = company]$company)
#       User      System verstrichen 
#      0.150       0.028       0.052 

system.time(collapse::funique(x$company))
#       User      System verstrichen 
#      0.053       0.000       0.053 

system.time(kit::funique(x$company))
#       User      System verstrichen 
#      0.018       0.000       0.018 

When using a key also the time to create the key should be considered:

system.time(setkeyv(x, "company"))
#       User      System verstrichen 
#      0.241       0.012       0.253 

It looks like kit::funique is currently the fastest followed by collapse::funique. Using one thread base::unique is a little bit slower than using data.table.

Data and libraries:

set.seed(42)
n <- 1e7
company <- c("A", "S", "W", "L", "T", "T", "W", "A", "T", "W")
item <- c("Thingy", "Thingy", "Widget", "Thingy", "Grommit", 
          "Thingy", "Grommit", "Thingy", "Widget", "Thingy")
sales <- c(120, 140, 160, 180, 200, 120, 140, 160, 180, 200)

library(data.table)
x <- data.table(company = sample(company, n, TRUE), 
                      item = sample(item, n, TRUE), 
                sales = sample(sales, n, TRUE))
GKi
  • 37,245
  • 2
  • 26
  • 48