Remove a range in data.table

Question

I am trying to exclude some rows from a datatable based on, let's say, days and month - excluding for example summer holidays, that always begin for example 15th of June and end the 15th of next month. I can extract those days based on Date, but as as.Date function is awfully slow to operate with, I have separate integer columns for Month and Day and I want to do it using only them.

It is easy to select the given entries by

DT[Month==6][Day>=15]
DT[Month==7][Day<=15]

Is there any way how to make "difference" of the two data.tables (the original ones and the ones I selected). (Why not subset? Maybe I am missing something simple, but I don't want to exclude days like 10/6, 31/7.)

I am aware of a way to do it with join, but only day by day

setkey(DT, Month, Day)
DT[-DT[J(Month,Day), which= TRUE]]

Can anyone help how to solve it in more general way?

Good question. You'll likely be interested in the answers and discussion that resulted from the same question when it was asked here: http://stackoverflow.com/questions/12319083/how-do-i-do-a-negative-nomatch-inverse-search-in-data-table/12319774#12319774 — Josh O'Brien, Oct 22 '12 at 18:39
And there was [this answer](http://stackoverflow.com/a/13013289/1281189) just a little while ago — BenBarnes, Oct 22 '12 at 18:50
Hmm, yes, that is interesting, but for me the problem is, that I cannot do negative match in this way, because I try to do subset based on two columns... :-/ The problem I need to do complement to intersection... — krhlk, Oct 22 '12 at 18:51

Matt Dowle · Accepted Answer · 2012-10-22T21:20:13.090

3

Great question. I've edited the question title to match the question.

A simple approach avoiding as.Date which reads nicely :

DT[!(Month*100L+Day) %between% c(0615L,0715L)]

That's probably fast enough in many cases. If you have a lot of different ranges, then you may want to step up a gear :

DT[,mmdd:=Month*100L+Day]
from = DT[J(0615),mult="first",which=TRUE]
to = DT[J(0715),mult="first",which=TRUE]
DT[-(from:to)]

That's a bit long and error prone because it's DIY. So one idea is that a list column in an i table would represent a range query (FR#203, like a binary search %between%). Then a not-join (also not yet implemented, FR#1384) could be combined with the list column range query to do exactly what you asked :

setkey(DT,mmdd)
DT[-J(list(0615,0715))]

That would extend to multiple different ranges, or the same range for many different ids, in the usual way; i.e., more rows added to i.

edited Oct 22 '12 at 21:20

answered Oct 22 '12 at 21:08

Matt Dowle

58,872
22
166
224

And what about "negative select"? I mean something that would have syntax something like `!DT[Month==6][Day>=15]` and would produce the original dataset without the selected entries? – krhlk Oct 24 '12 at 10:49
@tomaskrehlik Huh? That's what my answer answers, doesn't it? Are you defining a new term _negative-select_ different to _not-join_? Another way is just : `DT[!(Month==6 & Day>=15)]`. – Matt Dowle Oct 24 '12 at 10:56
yes, sorry, the answer gets the job done, of course. The comment was a mere suggestion for feature that I think would read nicely and be extremely easy to understand. – krhlk Oct 24 '12 at 11:05
@tomaskrehlik Is your suggestion syntax like `!DT[Month==6][Day>=15]`? This can't work because `!` of a dataset makes no sense. The compound `[Month==6][Day>=15]` is missing the raison d'etre for everything being inside one `[...]` if possible. No `j` in the first `[Month==6]` means all columns will be subset, to be passed to the second `[Day>=15]`. `data.table` wants you to combine things together and stick the _select_ query (or _negative-select_) as `i` inside `[...]`, not outside. Then it can make optimizations under the hood for you. – Matt Dowle Oct 24 '12 at 11:22

score 1 · Answer 2 · edited May 23 '17 at 10:24

1

Based on the answer here, you might try something like

# Sample data
DT <- data.table(Month = sample(c(1,3:12), 100, replace = TRUE),
  Day = sample(1:30, 100, replace = TRUE), key = "Month,Day")

# Dates that you want to exclude
excl <- as.data.table(rbind(expand.grid(6, 15:30), expand.grid(7, 1:15)))

DT[-na.omit(DT[excl, which = TRUE])]

If your data contain at least one entry for each day you want to exclude, na.omit might not be required.

edited May 23 '17 at 10:24

Community

1
1

answered Oct 22 '12 at 19:10

BenBarnes

19,114
6
56
74

Thx, I can actually loop (omg, he is looping!!! yes!) over the `Join` up in my code, I was hoping for something more general/user friendly. – krhlk Oct 22 '12 at 19:15
1

@tomaskrehlik, It looks like a more general / user friendly method is on Matthew Dowle's to-do list, as he mentions in his comment in that answer I linked to. If looping over the join works for you, that's fine! – BenBarnes Oct 22 '12 at 19:21
ahh, I didn't read that comment! That was what I was looking for, so it does't seem to exist. Thanks! – krhlk Oct 22 '12 at 19:23

Remove a range in data.table

2 Answers2

Linked