1

I have a data.table that I want to pull out the last 10,000 lines on a per- group basis. Unfortunately, I have been getting inconsistent results depending on the method employed, so am obviously not understanding the full picture. I have concerns about each of my methods.

The data is structured in such a way that I have a set of columns that I wish to group by, where I want to grab the entries corresponding with the last 10,000 POSIXct timestamps (if that many exist...otherwise return all). Entries are defined as unique by the combination of the grouping columns along with the timestamps, even though there are several other data columns. In the below example, my timestamp column is in ts and keycol1 and keycol2 are the fields that I'm grouping on. DT has 2,809,108 entries.

setkey(DT,keycol1,keycol2,ts)
DT[DT[,.I[.N-10000:.N], by=c("keycol1","keycol2")]$V1]

returns 1,181,256 entries. No warning or error is issued. My concern is what is happening when .N-10000 for the group is <1? When I do a DT[-10:10] I receive the following error.

Error in [.data.table(DT, -10:10) : Item 1 of i is -10 and item 12 is 1. Cannot mix positives and negatives.

leading me to believe that the .I[.N-10000:.N] may not be working as intended.

If I instead try to sort the timestamps backward and then use the strategy described by Jaap in his answer to this question

DT[DT[order(-ts),.I[1:10000], by=c("keycol1","keycol2")]$V1],nomatch=NULL]

returns 3,810,000 entries, some of which are all NA, suggesting that the nomatch parameter isn't being honored (nomatch=0 returns the same). Chaining a [!is.na(ts)] tells me that it returns 1,972,166 valid entries, which is more than the previous "solution." However, do the values for .I correspond with the row numbers of the original DT or of the reverse-sorted (within group) DT? So, does the outer selection return the true matches, or will it actually result in the first 10000 entries per group, rather than the last 10000?

okay, to resolve this question can I have the key itself work backward?

setkey(DT,keycol1,keycol2,ts)
setorder(DT,keycol1,keycol2,-ts)
key(DT)

NULL

setkey(DT,keycol1,keycol2,-ts)

Error in setkeyv(x, cols, verbose = verbose, physical = physical) : some columns are not in the data.table: -ts

That'd be a no then.

Can this be resolved by using .SD rather than .I?

DT[
  DT[order(-ts), .SD[1:10000], by=c("keycol1","keycol2")],
  nomatch=0, on=c("keycol1","keycol2","ts")
]

returns 1,972,166 entries. Although I'm fairly confident that these entries are the ones I want, it also results in a duplication of columns not part of the key or timestamp (as i.A, i.B, etc). I think these are the same entries as the .I[1:10000] example with the order(-ts), as if I store each, delete the extra columns in the .SD method, then do a setkey(resultA, keycol1,keycol2,ts) for each then do identical(resultA,resultB) it returns

TRUE

Related threads:

Throw away first and last n rows

data.table - select first n rows within group

How to extract the first n rows per group?

mpag
  • 531
  • 7
  • 19
  • 3
    Have you tried `DT[, tail(.SD, 10e3), .(keycol1, keycol2)]`? Your `.I` approach is missing parentheses, see e.g. `1 - 4:5` vs `(1 - 4):5`, but even with parentheses you would have to deal with possible negatives, whereas `tail` takes care of that automatially. If the `tail(.SD` is slow you could try `DT[DT[, tail(.I, 10e3), .(keycol1, keycol2)]$V1]` – IceCreamToucan Jul 18 '19 at 20:12
  • thanks doubly. Good point about parens - I haven't recently looked at order of ops documentation, but you're almost definitely correct there. And using tail seems the most logical way of handling this – mpag Jul 19 '19 at 00:08

0 Answers0