Speed up data.table subsetting by group

Question

I have nearly 10m rows and I want to choose only first three rows from each group.

I use

data[x == 1 | y > -6, .SD[1:3], by = z]

I need to get as a result

but it is very slow, because 10M it is only train set. So maybe any ideas how to optimize this. Thank you in advance.

Hi, welcome to SO. Please consider reading up on [ask] and how to produce a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). It makes it easier for others to help you. — Heroka, Feb 12 '16 at 14:34
What version of `data.table` are you running? Have you read the [binary search](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-keys-fast-subset.html) vignette? — MichaelChirico, Feb 12 '16 at 14:34
[Update to the development version](https://github.com/Rdatatable/data.table/wiki/Installation) -- there's recently been some `GForce` optimization of the operation you're doing. — MichaelChirico, Feb 12 '16 at 14:36
Are you aware of this answer? http://stackoverflow.com/a/16574176/1191259 Also, I guess you want `seq_len(min(.N,3))` in case a z-group doesn't have three rows. — Frank, Feb 12 '16 at 14:38
@Frank first I order data.table by z asc and y decs. I have not mentioned it in my question. Sorry. How could I add other columns to the output? — Vitaliy Radchenko, Feb 12 '16 at 14:48
@VitaliyRadchenko I'm not sure I understand what you're asking. You could edit your question text to clarify. — Frank, Feb 12 '16 at 14:55
Try `data[data[x == 1L | y > -6L, .I[1:3], by=z]$V1]` until this case is optimised. — Arun, Feb 12 '16 at 14:58

jangorecki · Answer 1 · 2016-02-12T15:42:06.837

2

Your example is not reproducible. I recommend to read how to ask SO questions on R tag to make the R tag on SO a solid knowledge base rather than fast and much more temporal Q&A.

Sorry for off-topic.

You can potentially get a significant speed-up when using data.table index. It currently requires to filter only on single variable. In your case you would look like:

set2key(data, x)
ix = data[x == 1, which = TRUE]
iy = data[y > -6, which = TRUE] # this will not use index (yet)!
data[union(ix, iy), ...]

Use options("datatable.verbose"=TRUE) to ensure you are using indexes.

The code is not reproducible due to lack of sample of data. So I cannot provide any benchmark, which may be valuable because potential speed-up depends on the data, and may results in slow down instead.

edited Feb 12 '16 at 15:42

answered Feb 12 '16 at 15:30

jangorecki

16,384
4
79
160

The best answer is `data[data[x == 1L | y > -6L, .I[1:3], by=z]$V1]`. Thank you. – Vitaliy Radchenko Feb 12 '16 at 15:36
@VitaliyRadchenko It depends on your data. The package does special optimization when you use a single inequality test at a time (as Jan does in this answer), so it may be faster in some cases. – Frank Feb 12 '16 at 15:38
@VitaliyRadchenko Generally it depends on cardinality of data. – jangorecki Feb 12 '16 at 15:41

Speed up data.table subsetting by group

1 Answers1