-3

I have nearly 10m rows and I want to choose only first three rows from each group.

I use

data[x == 1 | y > -6, .SD[1:3], by = z]

I need to get as a result

enter image description here

but it is very slow, because 10M it is only train set. So maybe any ideas how to optimize this. Thank you in advance.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    Hi, welcome to SO. Please consider reading up on [ask] and how to produce a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). It makes it easier for others to help you. – Heroka Feb 12 '16 at 14:34
  • What version of `data.table` are you running? Have you read the [binary search](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-keys-fast-subset.html) vignette? – MichaelChirico Feb 12 '16 at 14:34
  • @MichaelChirico data.table 1.9.6 – Vitaliy Radchenko Feb 12 '16 at 14:35
  • [Update to the development version](https://github.com/Rdatatable/data.table/wiki/Installation) -- there's recently been some `GForce` optimization of the operation you're doing. – MichaelChirico Feb 12 '16 at 14:36
  • Are you aware of this answer? http://stackoverflow.com/a/16574176/1191259 Also, I guess you want `seq_len(min(.N,3))` in case a z-group doesn't have three rows. – Frank Feb 12 '16 at 14:38
  • @Frank first I order data.table by z asc and y decs. I have not mentioned it in my question. Sorry. How could I add other columns to the output? – Vitaliy Radchenko Feb 12 '16 at 14:48
  • @VitaliyRadchenko I'm not sure I understand what you're asking. You could edit your question text to clarify. – Frank Feb 12 '16 at 14:55
  • 1
    Try `data[data[x == 1L | y > -6L, .I[1:3], by=z]$V1]` until this case is optimised. – Arun Feb 12 '16 at 14:58
  • @Frank I've added an image. – Vitaliy Radchenko Feb 12 '16 at 15:10
  • @Arun Perfect! Thank you! – Vitaliy Radchenko Feb 12 '16 at 15:27

1 Answers1

2

Your example is not reproducible. I recommend to read how to ask SO questions on R tag to make the R tag on SO a solid knowledge base rather than fast and much more temporal Q&A.

Sorry for off-topic.

You can potentially get a significant speed-up when using data.table index. It currently requires to filter only on single variable. In your case you would look like:

set2key(data, x)
ix = data[x == 1, which = TRUE]
iy = data[y > -6, which = TRUE] # this will not use index (yet)!
data[union(ix, iy), ...]

Use options("datatable.verbose"=TRUE) to ensure you are using indexes.

The code is not reproducible due to lack of sample of data. So I cannot provide any benchmark, which may be valuable because potential speed-up depends on the data, and may results in slow down instead.

jangorecki
  • 16,384
  • 4
  • 79
  • 160
  • The best answer is `data[data[x == 1L | y > -6L, .I[1:3], by=z]$V1]`. Thank you. – Vitaliy Radchenko Feb 12 '16 at 15:36
  • @VitaliyRadchenko It depends on your data. The package does special optimization when you use a single inequality test at a time (as Jan does in this answer), so it may be faster in some cases. – Frank Feb 12 '16 at 15:38
  • @VitaliyRadchenko Generally it depends on cardinality of data. – jangorecki Feb 12 '16 at 15:41