Aggregating in R over 80K unique ID's

Question

Another novice question regarding big data. I'm working with a large dataset (3.5m rows) with time series data. I want to create a data.table with a column that finds the first time the unique identifier appears.

df is a data.table, df$timestamp is a date in class POSIXct, and df$id is the unique numeric identifier. I'm using the following code:

# UPDATED - DATA KEYED
setkey(df, id)
sub_df<-df[,(min(timestamp)), by=list(id)] # Finding first timestamp for each unique ID

Here's the catch. I'm aggregating over 80k unique ID's. R is choking. Anything I can do to optimize my approach?

On which command is R slow? I can't tell from your description. — Blue Magister, Jan 29 '13 at 19:26
Is your `data.table` keyed? if not, that will help significantly. — Justin, Jan 29 '13 at 19:28
On another note: you don't need to say `df$timestamp` in `J`. Just `min(timestamp)` should be sufficient. — Arun, Jan 29 '13 at 19:30
Why not use !duplicated(df$id) (only if you data set is ordered by date)? — Rcoster, Jan 29 '13 at 19:34
Hi! Thanks @Justin. It worked! What is the etiquette for posting that as the solution? — roody, Jan 29 '13 at 19:38
@roody, are you sure if the `setkey` was the reason? I see that you also removed the `df$` from `df$timestamp`. I am not quite sure of the slowness due to `setkey`... — Arun, Jan 29 '13 at 19:47
@roody, please see my answer and ensure that you don't use `df$` as in the case of `data.frame`s. — Arun, Jan 29 '13 at 19:56

score 6 · Answer 1 · edited Jan 30 '13 at 12:29

6

Here's a little code to test which behaviour takes a LOT of time

require(data.table)
dt <- data.table(sample(seq(as.Date("2012-01-01"), as.Date("2013-12-31"), 
          by="days"), 1e5, replace=T), val=sample(1e4, 1e5, replace = T))

FUN1 <- function() {
    out <- dt[, min(dt$V1), by=val]  # min of entire V1 for each group i.e. wrong
}

FUN2 <- function() {
    out <- dt[, min(V1), by=val]     # min of V1 within group as intended
}

require(rbenchmark)
> benchmark(FUN1(), FUN2(), replications = 1, order="elapsed")
#     test replications elapsed relative user.self sys.self user.child sys.child
# 2 FUN2()            1   0.271    1.000     0.242    0.002          0         0
# 1 FUN1()            1  38.378  141.616    32.584    4.153          0         0

It is very clear that FUN2() is blazing fast. Remember, in both cases, KEY wasn't set

edited Jan 30 '13 at 12:29

Matt Dowle

58,872
22
166
224

answered Jan 29 '13 at 19:53

Arun

116,683
26
284
387

2

Touche! For posterity, benchmarking a function with `setkey` before the operation is identical to your `FUN2`. – Justin Jan 29 '13 at 19:57
Not sure the results of FUN1 and FUN2 are the same, though. Isn't FUN1 finding and refinding the same global minimum over the whole of V1 each time? But FUN2 is getting the right answer (minimum within each group). – Matt Dowle Jan 30 '13 at 11:39
1

@MatthewDowle, yes. spot on. they are not the same. But this is what the OP was running earlier and thought that he was doing it right. In addition to the fact that the results are not correct, the speed he was lacking was also because of this is what I guess I should have said. – Arun Jan 30 '13 at 12:11

Justin · Accepted Answer · 2013-01-29T19:58:39.677

5

As mentioned by @Arun, the real key (no pun intended) is the use of proper data.table syntax rather than setkey.

df[, min(timestamp), by=id]

While 80k unique ids sounds like a lot, using the key feature of data.table can make it a manageable prospect.

setkey(df, id)

Then process as before. For what its worth, you can often use a pleasant side effect of keys which is sorting.

set.seed(1)
dat <- data.table(x = sample(1:10, 10), y = c('a', 'b'))

    x y
 1:  3 a
 2:  4 b
 3:  5 a
 4:  7 b
 5:  2 a
 6:  8 b
 7:  9 a
 8:  6 b
 9: 10 a
10:  1 b

setkey(dat, y, x)

     x y
 1:  2 a
 2:  3 a
 3:  5 a
 4:  9 a
 5: 10 a
 6:  1 b
 7:  4 b
 8:  6 b
 9:  7 b
10:  8 b

Then the min or another more complex function is just a subset operation:

dat[, .SD[1], by=y]

edited Jan 29 '13 at 19:58

answered Jan 29 '13 at 19:42

Justin

42,475
9
93
111

(+1) hmm.. I am confused about section 3.2 of the FAQ [**here**](http://datatable.r-forge.r-project.org/datatable-faq.pdf) now. Any ideas? – Arun Jan 29 '13 at 19:45
While `data.table` has rapid sorting when using ad-hoc keys, it isn't as fast as a keyed table, especially in this instance where the key sorts first. Mr. Dowle or another master of `data.table` can expand on this further. – Justin Jan 29 '13 at 19:49
I just ran without setting key on a `data.table` and using `df$V1` takes for ever. Please hold on. I'll write post once the benchmarking is over. – Arun Jan 29 '13 at 19:51
check out the post now. you should be able to reproduce it. – Arun Jan 29 '13 at 19:54
1

Justin, see edits on other answers. `.SD[1]` is another way to avoid the `min()` on `POSIXct`, which is the reason you may have seen that faster. But in priciple I'd expect that to be slower than `df[, min(colDoubleType), by=id]` due to avoiding `.SD` (which if used has to be created in full for each group by data.table, currently). – Matt Dowle Jan 30 '13 at 12:17

score 4 · Answer 3 · edited May 23 '17 at 11:45

In complement to Arun's answer, here's something with a data set of similar size to OP's (3.5M rows, 80K ID's) that shows that keyed/unkeyed aggregation is not too different. So the speedup may be due to avoiding the $ operator.

set.seed(10)
eg <- function(x) data.table(id=sample(8e4,x,replace=TRUE),timestamp=as.POSIXct(runif(x,min=ISOdatetime(2013,1,1,0,0,0) - 60*60*24*30, max=ISOdatetime(2013,1,1,0,0,0)),origin="1970-01-01"))
df <- eg(3.5e6)
dfk <- copy(df)
setkey(dfk,id)
require(microbenchmark)
microbenchmark(
    unkeyed = df[,min(timestamp),by=id][,table(weekdays(V1))]
    ,keyed = dfk[,min(timestamp),by=id][,table(weekdays(V1))]
    ,times=5
)
#Unit: seconds
#     expr      min       lq   median       uq      max
#1   keyed 7.330195 7.381879 7.476096 7.486394 7.690694
#2 unkeyed 7.882838 7.888880 7.924962 7.927297 7.931368

Edit from Matthew.

Actually the above is almost entirely to do with the type POSIXct.

> system.time(dfk[,min(timestamp),by=id])
   user  system elapsed 
   8.71    0.02    8.72 
> dfk[,timestamp:=as.double(timestamp)]  # discard POSIXct type to demonstrate
> system.time(dfk[,min(timestamp),by=id])
   user  system elapsed 
   0.14    0.02    0.15     # that's more like normal data.table speed

Reverting to POSIXct and using Rprof shows it's 97% inside min() for that type (i.e. nothing to do with data.table) :

$by.total
                total.time total.pct self.time self.pct
system.time           8.70    100.00      0.00     0.00
[.data.table          8.64     99.31      0.12     1.38
[                     8.64     99.31      0.00     0.00
min                   8.46     97.24      0.46     5.29
Summary.POSIXct       8.00     91.95      0.86     9.89
do.call               5.86     67.36      0.26     2.99
check_tzones          5.46     62.76      0.20     2.30
unique                5.26     60.46      2.04    23.45
sapply                3.74     42.99      0.46     5.29
simplify2array        2.38     27.36      0.16     1.84
NextMethod            1.28     14.71      1.28    14.71
unique.default        1.10     12.64      0.92    10.57
lapply                1.10     12.64      0.76     8.74
unlist                0.60      6.90      0.28     3.22
FUN                   0.24      2.76      0.24     2.76
match.fun             0.22      2.53      0.22     2.53
is.factor             0.18      2.07      0.18     2.07
parent.frame          0.14      1.61      0.14     1.61
gc                    0.06      0.69      0.06     0.69
duplist               0.04      0.46      0.04     0.46
[.POSIXct             0.02      0.23      0.02     0.23

Notice the object size of dfk :

> object.size(dfk)
40.1 Mb

Nothing should take 7 seconds in data.table for this tiny size! It needs to be 100 times larger (4GB), with a not-flawed j, and then you can see the difference between keyed by and ad hoc by.

Edit from Blue Magister:

Taking Matthew Dowle's answer into account, there is a difference between keyed/unkeyed commands.

df <- eg(3.5e6)
df[,timestamp := as.double(timestamp)]
dfk <- copy(df)
setkey(dfk,id)
require(microbenchmark)
microbenchmark(
    unkeyed = df[,min(timestamp),by=id][,table(weekdays(as.POSIXct(V1,origin="1970-01-01")))]
    ,keyed = dfk[,min(timestamp),by=id][,table(weekdays(as.POSIXct(V1,origin="1970-01-01")))]
    ,times=10
)
#Unit: milliseconds
#     expr      min       lq   median       uq       max
#1   keyed 340.3177 346.8308 348.7150 354.7337  358.1348
#2 unkeyed 886.1687 888.7061 901.1527 945.6190 1036.3326

@MatthewDowle Thanks for clearing it up. I adjusted my benchmark to move the `POSIXct` commands outside the aggregation step. I see a huge improvement compared to before, and a 2x speed improvement of keyed `min` over unkeyed `min`. — Blue Magister, Jan 30 '13 at 17:01

Martin Morgan · Answer 4 · 2013-01-30T14:49:39.330

Here's a data.table version

dtBy <- function(dt)
    dt[, min(timestamp), by=id]

Going a little old-school, here's a function that returns the minimum of each group

minBy <- function(x, by) {
    o <- order(x)
    by <- by[o]
    idx <- !duplicated(by)
    data.frame(by=by[idx], x=x[o][idx])
}

and seems to have reasonable performance for the sample data of BlueMagister

> system.time(res0 <- dtBy(dt))
   user  system elapsed 
 11.165   0.216  11.894 
> system.time(res1 <- minBy(dt$timestamp, dt$id))
   user  system elapsed 
  4.784   0.036   4.836 
> all.equal(res0[order(res0$id),], res1[order(res1$by),],
+           check.attributes=FALSE)
[1] TRUE

Edit from Matthew (mostly)

Yes, this is because minBy avoids min() on the POSIXct type, which is the terribly slow part to repeat. This is nothing to do with data.table.

Using the dfk from Blue M's answer :

dfk[,timestamp:=as.double(timestamp)]  # discard POSIXct type to demonstrate

system.time(res0 <- dtBy(dfk))
   user  system elapsed 
   0.16    0.02    0.17 
system.time(res1 <- minBy(dfk$timestamp, dfk$id))
   user  system elapsed 
   4.87    0.04    4.92

Now the old school method looks very slow in comparison to data.table. All the time was being spent in min() on the POSIXct type. See edit on Arun's answer for the Rprof output.

Aggregating in R over 80K unique ID's

4 Answers4

Linked