How to speed up subset by groups

Question

I used to achieve my data wrangling with dplyr, but some of the computations are "slow". In particular subset by groups, I read that dplyr is slow when there is a lot of groups and based on this benchmark data.table could be faster so I started to learn data.table.

Here is how to reproduce something close to my real datas with 250k rows and about 230k groups. I would like to group by id1, id2 and subset the rows with the max(datetime) for each group.

Datas

# random datetime generation function by Dirk Eddelbuettel
# https://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates
rand.datetime <- function(N, st = "2012/01/01", et = "2015/08/05") {
  st <- as.POSIXct(as.Date(st))
  et <- as.POSIXct(as.Date(et))
  dt <- as.numeric(difftime(et,st,unit="sec"))
  ev <- sort(runif(N, 0, dt))
  rt <- st + ev
}

set.seed(42)
# Creating 230000 ids couples
ids <- data.frame(id1 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"), 
                  id2 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"))
# Repeating randomly the ids[1:2000, ] to create groups
ids <- rbind(ids, ids[sample(1:2000, 20000, replace = TRUE), ])
# Adding random datetime variable and dummy variables to reproduce real datas
datas <- transform(ids, 
                   datetime = rand.datetime(25e4), 
                   var1 = sample(LETTERS[1:6], 25e4, rep = TRUE), 
                   var2 = sample(c(1:10, NA), 25e4, rep = TRUE), 
                   var3 = sample(c(1:10, NA), 25e4, rep = TRUE), 
                   var4 = rand.datetime(25e4), 
                   var5 = rand.datetime(25e4))

datas.tbl <- tbl_df(datas)
datas.dt <- data.table(datas, key = c("id1", "id2"))

I couldn't find the straight way to subset by groups with data.table so I asked this question : Filter rows by groups with data.table

We suggest me to use .SD :

datas.dt[, .SD[datetime == max(datetime)], by = c("id1", "id2")]

But I have two problems, it works with date but not with POSIXct ("Error in UseMethod("as.data.table") : no applicable method for 'as.data.table' applied to an object of class "c('POSIXct', 'POSIXt')""), and this is very slow. For example with Dates :

> system.time({
+   datas.dt[, .SD[as.Date(datetime) == max(as.Date(datetime))], by = c("id1", "id2")]
+ })
 utilisateur     système      écoulé 
      207.03        0.00      207.48

So I found other way much faster to achieve this (and keeping datetimes) with data.table :

Functions

f.dplyr <- function(x) x %>% group_by(id1, id2) %>% filter(datetime == max(datetime))
f.dt.i <- function(x) x[x[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1]
f.dt <- function(x) x[x[, datetime == max(datetime), by = c("id1", "id2")]$V1]

But then I thought data.table would be much faster, the time difference with dplyr isn't significative.

Microbenchmark

mbm <- microbenchmark(
  dplyr = res1 <- f.dplyr(datas.tbl), 
  data.table.I = res2 <- f.dt.i(datas.dt), 
  data.table = res3 <- f.dt(datas.dt), 
  times = 50L)

Unit: seconds
         expr      min       lq     mean   median       uq      max neval
        dplyr 31.84249 32.24055 32.59046 32.61311 32.88703 33.54226    50
 data.table.I 30.02831 30.94621 31.19660 31.17820 31.42888 32.16521    50
   data.table 30.28923 30.84212 31.09749 31.04851 31.40432 31.96351    50

Am I missing/misusing something with data.table ? Do you have ideas to speed up this computation ?

Any help would be highly appreciated ! Thanks

Edit : Some precisions about the system and packages versions used for the microbenchmark. (The computer isn't a war machine, 12Go i5)

System

sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
  [1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] readr_0.1.0          ggplot2_1.0.1        microbenchmark_1.4-2
[4] data.table_1.9.4     dplyr_0.4.1          plyr_1.8.2          

loaded via a namespace (and not attached):
  [1] assertthat_0.1   chron_2.3-45     colorspace_1.2-6 DBI_0.3.1       
[5] digest_0.6.8     grid_3.1.3       gtable_0.1.2     lazyeval_0.1.10 
[9] magrittr_1.5     MASS_7.3-39      munsell_0.4.2    parallel_3.1.3  
[13] proto_0.3-10     Rcpp_0.11.5      reshape2_1.4.1   scales_0.2.4    
[17] stringi_0.4-1    stringr_0.6.2    tools_3.1.3 

> packageVersion("data.table")
[1] ‘1.9.4’
> packageVersion("dplyr")
[1] ‘0.4.1’

You want to get all the values that equals to max or just the first value like `which.max` returns? Also `datas.dt[, .SD[as.Date(datetime) == max(as.Date(datetime))], by = c("id1", "id2")]` is a bad practice. You should convert `date` to `IDate` class before subsetting. — David Arenburg, Aug 06 '15 at 10:11
Just for fun, can you add `x %>% group_by(id1, id2) %>% slice(which(datetime == max(datetime)))` to your comparison? — talat, Aug 06 '15 at 10:15
Also `datas.dt[, datetime := as.IDate(datetime)] ; system.time(datas.dt[datas.dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1])` runs only 5 seconds compared to 200 when using `.SD`, so I find hard to believe your benchmarks. — David Arenburg, Aug 06 '15 at 10:18
@docendodiscimus this runs 228 seconds on my machine compared to `datas.dt[datas.dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1]` which runs only 5 seconds — David Arenburg, Aug 06 '15 at 10:24
@DavidArenburg, congrats, though that's not the comparison I was aiming at.. anyway, I was just asking out of curiosity. — talat, Aug 06 '15 at 10:32
@docendodiscimus I wasn't bragging or anything, so not sure what are you congratulating me for. OP is looking for a `data.table` solution because he assumes it will be faster than `dplyr`- this is why I compare your proposal against `data.table` in case his assumption is wrong. — David Arenburg, Aug 06 '15 at 10:33
@DavidArenburg I want to use `max` to keep all the values equals to max and not `which.max`. And indeed coercing to IDate before computation is much faster (~10sec on my machine). But the benchmarks were run with the exact same code than in the question, you can trust them... maybe the next question I should ask is "how to speed up my computer" :) — Julien Navarre, Aug 06 '15 at 17:05

Arun · Accepted Answer · 2015-08-06T13:25:49.293

Great question!

I'll assume df and dt to be the names of objects for easy/quick typing.

df = datas.tbl
dt = datas.dt

Comparison at -O3 level optimisation:

First, here's the timing on my system on the current CRAN version of dplyr and devel version of data.table. The devel version of dplyr seems to suffer from performance regressions (and is being fixed by Romain).

system.time(df %>% group_by(id1, id2) %>% filter(datetime == max(datetime)))
#  25.291   0.128  25.610 

system.time(dt[dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1])
#  17.191   0.075  17.349

I ran this quite a few times, and dint seem to change. However, I compile all packages with -O3 optimisation flag (by setting ~/.R/Makevars appropriately). And I've observed that data.table performance gets much better than other packages I've compared it with at -O3.

Grouping speed comparison

Second, it is important to understand the reason for such slowness. First let's compare the time to just group.

system.time(group_by(df, id1, id2))
#   0.303   0.007   0.311 
system.time(data.table:::forderv(dt, by = c("id1", "id2"), retGrp = TRUE))
#   0.002   0.000   0.002

Even though there are a total of 250,000 rows your data size is around ~38MB. At this size, it's unlikely to see a noticeable difference in grouping speed.

data.table's grouping is >100x faster here, it's clearly not the reason for such slowness...

Why is it slow?

So what's the reason? Let's turn on datatable.verbose option and check again:

options(datatable.verbose = TRUE)
dt[dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1]
# Detected that j uses these columns: datetime 
# Finding groups (bysameorder=TRUE) ... done in 0.002secs. bysameorder=TRUE and o__ is length 0
# lapply optimization is on, j unchanged as '.I[datetime == max(datetime)]'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Starting dogroups ... 
#   memcpy contiguous groups took 0.097s for 230000 groups
#   eval(j) took 17.129s for 230000 calls
# done dogroups in 17.597 secs

So eval(j) alone took ~97% of the time! The expression we've provided in j is evaluated for each group. Since you've 230,000 groups, and there's a penalty to the eval() call, that adds up.

Avoiding the eval() penalty

Since we're aware of this penalty, we've gone ahead and started implementing internal versions of some commonly used functions: sum, mean, min, max. This will/should be expanded to as many other functions as possible (when we find time).

So, let's try computing the time for just obtaining max(datetime) first:

dt.agg = dt[, .(datetime = max(datetime)), by = .(id1, id2)]
# Detected that j uses these columns: datetime 
# Finding groups (bysameorder=TRUE) ... done in 0.002secs. bysameorder=TRUE and o__ is length 0
# lapply optimization is on, j unchanged as 'list(max(datetime))'
# GForce optimized j to 'list(gmax(datetime))'

And it's instant. Why? Because max() gets internally optimised to gmax() and there's no eval() call for each of the 230K groups.

So why isn't datetime == max(datetime) instant? Because it's more complicated to parse such expressions and optimise internally, and we have not gotten to it yet.

Workaround

So now that we know the issue, and a way to get around it, let's use it.

dt.agg = dt[, .(datetime = max(datetime)), by = .(id1, id2)]
dt[dt.agg, on = c("id1", "id2", "datetime")] # v1.9.5+

This takes ~0.14 seconds on my Mac.

Note that this is only fast because the expression gets optimised to gmax(). Compare it with:

dt[, .(datetime = base::max(datetime)), by = .(id1, id2)]

I agree optimising more complicated expressions to avoid the eval() penalty would be the ideal solution, but we are not there yet.

Thanks for this enlightening answer. You gave me a solution to divide the execution time by 100 but also helped me a lot to understand the bottleneck in this computation ! Thanks. — Julien Navarre, Aug 06 '15 at 16:53

Khashaa · Answer 2 · 2015-08-06T10:37:40.260

10

How about summarizing the data.table and join original data

system.time({
  datas1 <- datas.dt[, list(datetime=max(datetime)), by = c("id1", "id2")] #summarize the data
  setkey(datas1, id1, id2, datetime)
  setkey(datas.dt, id1, id2, datetime)
  datas2 <- datas.dt[datas1]
})
#  user  system elapsed 
# 0.083   0.000   0.084

which correctly filters the data

system.time(dat1 <- datas.dt[datas.dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1])
#   user  system elapsed 
# 23.226   0.000  23.256 
all.equal(dat1, datas2)
# [1] TRUE

Addendum

setkey argument is superfluous if you are using the devel version of the data.table (Thanks to @akrun for the pointer)

system.time({
  datas1 <- datas.dt[, list(datetime=max(datetime)), by = c("id1", "id2")] #summarize the data
  datas2 <- datas.dt[datas1, on=c('id1', 'id2', 'datetime')]
})

edited Aug 06 '15 at 10:37

answered Aug 06 '15 at 10:16

Khashaa

7,293
2
21
37

1

In the devel version you don't need the `setkey`. `datas.dt[datas1, on=c('id1', 'id2')]` should work. though not tested with the timings. – akrun Aug 06 '15 at 10:25
@akrun, Thanks. I am blind to nuts and bolts of the `data.table`. – Khashaa Aug 06 '15 at 10:30
You should probably keep both versions, as your edit works only on the devel version. – David Arenburg Aug 06 '15 at 10:30
When I tested, the `setkey` version is slightly fast on `system.time`. But, it may be random too. – akrun Aug 06 '15 at 10:33
2

@akrun, yes an open issue [on GH](https://github.com/Rdatatable/data.table/issues/1232). This is another reason why I think we should keep both options. Btw, nice solution Kashaa, you maybe just redefined the canonical solution for such tasks instead of [this](http://stackoverflow.com/questions/16573995/subset-by-group-with-data-table) – David Arenburg Aug 06 '15 at 10:35
@DavidArenburg Thanks for the link. Good to know that it is already discussed. – akrun Aug 06 '15 at 10:38
@DavidArenburg I thought this approach was common knowledge among data.table users, as a counterpart to `dplyr::filter`. – Khashaa Aug 06 '15 at 10:55
I don't think it's quite common. The common approach is `.SD` and `.I`. Though I think I saw @Arun doing something similar, but not sure. You can see per the [linked answer](http://stackoverflow.com/questions/16573995/subset-by-group-with-data-table), that the `.I` approach is widely accepted. – David Arenburg Aug 06 '15 at 10:59
@DavidArenburg As a novice `data.table` user, I never figured how `.I` worked, partly because it is nowhere to be found at https://github.com/Rdatatable/data.table/wiki/Getting-started. – Khashaa Aug 06 '15 at 11:13
1

@Khashaa take a look at [this answer](http://stackoverflow.com/questions/31804177/replacing-the-last-value-within-groups-with-different-values/31835541#31835541) I think I've explained it pretty well. Though according to Aruns awesome answer, I'm starting to wonder if this solution will work better for *all* functions rather than just `sum`, `mean`, `min` and `max` – David Arenburg Aug 06 '15 at 11:17
@DavidArenburg, Thanks for the link and detailed explanation. From Aruns answer, I see now that the crux of the matter is `gmax`. – Khashaa Aug 06 '15 at 11:29
Or in other words it's all about `eval` (the desired operation is unknown and needs to be reevaluated in each iteration). That is the main difference between, let's say, `colMeans(df)` and `lapply(df, mean)` as discussed in more extent [here](http://stackoverflow.com/questions/28983292/is-the-apply-family-really-not-vectorized) – David Arenburg Aug 06 '15 at 11:32

How to speed up subset by groups

Datas

Functions

Microbenchmark

System

2 Answers2

Linked

Related