8

I am trying to exclude some rows from a datatable based on, let's say, days and month - excluding for example summer holidays, that always begin for example 15th of June and end the 15th of next month. I can extract those days based on Date, but as as.Date function is awfully slow to operate with, I have separate integer columns for Month and Day and I want to do it using only them.

It is easy to select the given entries by

DT[Month==6][Day>=15]
DT[Month==7][Day<=15]

Is there any way how to make "difference" of the two data.tables (the original ones and the ones I selected). (Why not subset? Maybe I am missing something simple, but I don't want to exclude days like 10/6, 31/7.)

I am aware of a way to do it with join, but only day by day

setkey(DT, Month, Day)
DT[-DT[J(Month,Day), which= TRUE]]

Can anyone help how to solve it in more general way?

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
krhlk
  • 1,574
  • 2
  • 15
  • 27
  • 1
    Good question. You'll likely be interested in the answers and discussion that resulted from the same question when it was asked here: http://stackoverflow.com/questions/12319083/how-do-i-do-a-negative-nomatch-inverse-search-in-data-table/12319774#12319774 – Josh O'Brien Oct 22 '12 at 18:39
  • And there was [this answer](http://stackoverflow.com/a/13013289/1281189) just a little while ago – BenBarnes Oct 22 '12 at 18:50
  • Hmm, yes, that is interesting, but for me the problem is, that I cannot do negative match in this way, because I try to do subset based on two columns... :-/ The problem I need to do complement to intersection... – krhlk Oct 22 '12 at 18:51
  • @tomaskrehlik Why not `subset`? – Matthew Plourde Oct 22 '12 at 19:06

2 Answers2

3

Great question. I've edited the question title to match the question.

A simple approach avoiding as.Date which reads nicely :

DT[!(Month*100L+Day) %between% c(0615L,0715L)]

That's probably fast enough in many cases. If you have a lot of different ranges, then you may want to step up a gear :

DT[,mmdd:=Month*100L+Day]
from = DT[J(0615),mult="first",which=TRUE]
to = DT[J(0715),mult="first",which=TRUE]
DT[-(from:to)]

That's a bit long and error prone because it's DIY. So one idea is that a list column in an i table would represent a range query (FR#203, like a binary search %between%). Then a not-join (also not yet implemented, FR#1384) could be combined with the list column range query to do exactly what you asked :

setkey(DT,mmdd)
DT[-J(list(0615,0715))]

That would extend to multiple different ranges, or the same range for many different ids, in the usual way; i.e., more rows added to i.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • And what about "negative select"? I mean something that would have syntax something like `!DT[Month==6][Day>=15]` and would produce the original dataset without the selected entries? – krhlk Oct 24 '12 at 10:49
  • @tomaskrehlik Huh? That's what my answer answers, doesn't it? Are you defining a new term _negative-select_ different to _not-join_? Another way is just : `DT[!(Month==6 & Day>=15)]`. – Matt Dowle Oct 24 '12 at 10:56
  • yes, sorry, the answer gets the job done, of course. The comment was a mere suggestion for feature that I think would read nicely and be extremely easy to understand. – krhlk Oct 24 '12 at 11:05
  • @tomaskrehlik Is your suggestion syntax like `!DT[Month==6][Day>=15]`? This can't work because `!` of a dataset makes no sense. The compound `[Month==6][Day>=15]` is missing the raison d'etre for everything being inside one `[...]` if possible. No `j` in the first `[Month==6]` means all columns will be subset, to be passed to the second `[Day>=15]`. `data.table` wants you to combine things together and stick the _select_ query (or _negative-select_) as `i` inside `[...]`, not outside. Then it can make optimizations under the hood for you. – Matt Dowle Oct 24 '12 at 11:22
1

Based on the answer here, you might try something like

# Sample data
DT <- data.table(Month = sample(c(1,3:12), 100, replace = TRUE),
  Day = sample(1:30, 100, replace = TRUE), key = "Month,Day")

# Dates that you want to exclude
excl <- as.data.table(rbind(expand.grid(6, 15:30), expand.grid(7, 1:15)))

DT[-na.omit(DT[excl, which = TRUE])]

If your data contain at least one entry for each day you want to exclude, na.omit might not be required.

Community
  • 1
  • 1
BenBarnes
  • 19,114
  • 6
  • 56
  • 74
  • Thx, I can actually loop (omg, he is looping!!! yes!) over the `Join` up in my code, I was hoping for something more general/user friendly. – krhlk Oct 22 '12 at 19:15
  • 1
    @tomaskrehlik, It looks like a more general / user friendly method is on Matthew Dowle's to-do list, as he mentions in his comment in that answer I linked to. If looping over the join works for you, that's fine! – BenBarnes Oct 22 '12 at 19:21
  • ahh, I didn't read that comment! That was what I was looking for, so it does't seem to exist. Thanks! – krhlk Oct 22 '12 at 19:23