the fastest method of extracting distinct values in r

Question

I wanted to recreate the example of the fastest method of extracting sorted unique values demonstrated in this post: What is the fastest way to get a vector of sorted unique values from a data.table?

test_df <-
  data.frame(
    company = c(1, 1,  2, 2, 3)
  )

unique_values = df[,logical(1), keyby = company]$company

But I keep getting error:

Error in [.data.frame(df, , logical(1), keyby = company) : unused argument (keyby = company)

Edit. Note that the focus of my question is to get this specific method to work. For proposals of other methods which achieve the goal, please follow the post to which I refer.

In case you don't need them sorted: `unique(test_df$company)` or still not so slow with sort in *base*: `sort(unique(test_df$company))` — GKi, Apr 13 '21 at 14:51
@GKi `unique(test_df$company)` is noticeably slow on large df. That is why I would like to get this example to work. — Przemyslaw Remin, Apr 13 '21 at 15:03
This might be the case for multiple cores / Threads. In case you use only one core or sum up the time per thread there should be not much difference. — GKi, Apr 13 '21 at 15:18
Your example does not work because you create a `data.frame` and want to use methods of `data.table`. So add the line in my first comment to convert it or create directly a `data.table`. — GKi, Apr 20 '21 at 07:09

GKi · Answer 1 · 2021-10-06T08:37:58.967

In case you are looking for a fast unique have a look at kit::funique or collapse::funique:

setDTthreads(1)
microbenchmark::microbenchmark(
dt = y[,logical(1), keyby = company]$company,
base = unique(x$company),
collapse = collapse::funique(x$company),
kit = kit::funique(x$company))
#Unit: milliseconds
#     expr       min        lq      mean    median        uq       max neval
#       dt 12.862388 13.575131 14.759180 14.248541 14.945780 49.930937   100
#     base 12.939646 13.505176 14.734066 14.773846 15.415468 18.256204   100
# collapse  3.302862  3.589133  3.685685  3.692886  3.773045  4.063564   100
#      kit  1.903043  2.433478  2.963308  2.882986  3.076537  6.183840   100

setDTthreads(4)
microbenchmark::microbenchmark(
dt = y[,logical(1), keyby = company]$company,
base = unique(x$company),
collapse = collapse::funique(x$company),
kit = kit::funique(x$company))
#Unit: milliseconds
#     expr       min        lq      mean    median        uq       max neval
#       dt  5.480513  7.384032  7.873730  7.569420  8.346282 11.193741   100
#     base 12.998406 13.295775 14.464446 13.736353 14.856721 47.320488   100
# collapse  3.333292  3.549712  3.655851  3.645528  3.737236  4.325676   100
#      kit  1.881232  2.825040  2.959422  2.917149  3.004288  5.281440   100

Data and Libraries:

set.seed(42)
n <- 1e6
company <- c("A", "S", "W", "L", "T", "T", "W", "A", "T", "W")
item <- c("Thingy", "Thingy", "Widget", "Thingy", "Grommit", 
          "Thingy", "Grommit", "Thingy", "Widget", "Thingy")
sales <- c(120, 140, 160, 180, 200, 120, 140, 160, 180, 200)

x <- data.frame(company = sample(company, n, TRUE), 
                      item = sample(item, n, TRUE), 
                sales = sample(sales, n, TRUE))

library(data.table)
y <- as.data.table(x)

Thanks, no. I want to get unique values in the way I posted in my question. Recommendation for the kit package may belong to the question I referred to. I saw you already posted this alternative there. — Przemyslaw Remin, Apr 15 '21 at 20:10

the fastest method of extracting distinct values in r

1 Answers1

Linked