0

I have a data.table, and a list of dates. I wish to filter and amend the rows using a function that checks to see if the dates against the list.

# example data
set.seed(1)
tt <- sample(
             seq(as.POSIXct("2011-10-02"), as.POSIXct("2014-04-06"),
                 by = "day"), 10)
IR1 <- data.table(tstamp = sort(tt), dLoc = 1L:10L)

List of dates:

DLSlist <- lapply(
                  list(dls11t12 = c("2011-10-02", "2012-04-01"), 
                       dls12t13 = c("2012-10-07", "2013-04-07"), 
                       dls13t14 = c("2013-10-06", "2014-04-06"),
                       dls14t15 = c("2014-10-05", "2015-04-05"),
                       dls15t16 = c("2015-10-04", "2016-04-03"),
                       dls16t17 = c("2016-10-02", "2017-04-02")
                       ), 
                  function(X) as.POSIXct(X)
                  )

I would like to transform dLoc if it falls inside any of the date ranges in DLSlist. I can do it the long way as follows:

IR1[tstamp > DLSlist[[1]][1] & tstamp < DLSlist[[1]][2], tstamp := tstamp + 60*60]
IR1[tstamp > DLSlist[[2]][1] & tstamp < DLSlist[[2]][2], tstamp := tstamp + 60*60]
IR1[tstamp > DLSlist[[3]][1] & tstamp < DLSlist[[3]][2], tstamp := tstamp + 60*60] 

However that seems error-prone: a function is suited to this task ... mine didn't work.

DLStest <- function(dd, DLSobj) {
    any(sapply(DLSobj, function(X) dd %between% X))
}

I applied it with:

IR1[DLStest(tstamp, DLSlist), tstamp := tstamp + 60*60]

However it didn't work: all of the rows were transformed (not only the ones inside the ranges, as had been the case in my ugly hack code).

Is there some means of selecting rows using a function -- or some other means of selecting rows based upon multiple range checks?


Update (with thanks to Frank, who spotted the issue)

You can indeed filter with a function that returns a vector or booleans. The error was all with my initial function.

DLStest_old <- function(dd, DLSobj) {
    any(sapply(DLSobj, function(X) dd %between% X))
}

sapply returns an object who's class is matrix; any checks to see if there are any true values in the entire matrix. If there are any true values is evaluates to a single TRUE. If not, it evaluates to a single FALSE.

Using the test data:

(IR1[DLStest_old(tstamp, DLSlist), dLoc := dLoc + 1000L])

                 tstamp dLoc
 1: 2011-11-27 01:00:00 1001
 2: 2012-04-03 00:00:00 1002
 3: 2012-06-01 00:00:00 1003
 4: 2012-09-06 00:00:00 1004
 5: 2013-03-09 01:00:00 1005
 6: 2013-04-25 00:00:00 1006
 7: 2013-05-25 00:00:00 1007
 8: 2013-12-29 01:00:00 1008
 9: 2014-01-09 01:00:00 1009
10: 2014-02-08 01:00:00 1010

The fix is to test separately for each row of the matrix, using apply.

DLStest <- function(dd, DLSobj) {
    apply(sapply(DLSobj, function(X) dd %between% X), 1, any)
}

This now works:

> (IR1[DLStest(tstamp, DLSlist), dLoc := dLoc + 1000L])
                 tstamp dLoc
 1: 2011-11-27 01:00:00 1001
 2: 2012-04-03 00:00:00    2
 3: 2012-06-01 00:00:00    3
 4: 2012-09-06 00:00:00    4
 5: 2013-03-09 01:00:00 1005
 6: 2013-04-25 00:00:00    6
 7: 2013-05-25 00:00:00    7
 8: 2013-12-29 01:00:00 1008
 9: 2014-01-09 01:00:00 1009
10: 2014-02-08 01:00:00 1010
ricardo
  • 8,195
  • 7
  • 47
  • 69
  • +1 Does `DLStest(IR1$tstamp,DLSlist)` give the expected result? I think it should only give a single true or false value for the entire tstamp vector as-is. Maybe if you add `by=1:nrow(IR1)` it will work... – Frank Oct 28 '13 at 08:05
  • 2
    [Please see this link](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Giving proper reproducible example will likely get better and faster answers. – CHP Oct 28 '13 at 08:11
  • @Frank: you are right, it gave only a single `logi` value back. Unfortunately `IR1[DLStest(tstamp, DLSlist), tstamp := tstamp + 60*60, by = 1:nrow(IR1)]` did not work. – ricardo Oct 28 '13 at 08:26
  • @geektrader -- sorry mate. fixed now. – ricardo Oct 28 '13 at 18:10
  • @Frank, you should make an answer so i can accept it. You were right, the issue was with the original function returning only a single value. – ricardo Oct 28 '13 at 18:38
  • Ok, done. It's better to post your answer as a separate answer than to edit it into the question, as explained here: http://meta.stackexchange.com/a/118459/209360 Anyway, I didn't think of that until after stealing your answer and merging it into mine. Cheers :) – Frank Oct 28 '13 at 20:02

2 Answers2

1

Your data looks like it doesn't have overlapping ranges in DLSlist, in which case this should work -

library(data.table)

#creating the data
DLSlist <- data.table(read.csv(textConnection('
                  "2011-10-02", "2012-04-01" 
                  "2012-10-07", "2013-04-07" 
                  "2013-10-06", "2014-04-06"
                  "2014-10-05", "2015-04-05"
                  "2015-10-04", "2016-04-03"
                  "2016-10-02", "2017-04-02"'), header = FALSE))

IR1 <- data.table(
   tstamp = c("2011-10-01", "2012-10-06", "2014-10-07","2016-10-03")
)

#fixing data type       
IR1[,tstamp := as.Date(tstamp,"%Y-%m-%d")]
DLSlist[,V1 := as.Date(V1,"%Y-%m-%d")]
DLSlist[,V2 := as.Date(V2,"%Y-%m-%d")]
DLSlist[,tstamp := V1]

#setting a key for data.table to find the closest match
setkey(IR1,tstamp)
setkey(DLSlist,tstamp)

#roll = Inf finds the closest match for the key
IR2 <- DLSlist[IR1, roll = Inf]

#Doing the operation where condition is satisfied
IR2[tstamp > V1 & tstamp < V2 , tstamp2 := tstamp + 60*60]

Output

> IR2
       tstamp         V1         V2    tstamp2
1: 2011-10-01       <NA>       <NA>       <NA>
2: 2012-10-06 2011-10-02 2012-04-01       <NA>
3: 2014-10-07 2014-10-05 2015-04-05 2024-08-15
4: 2016-10-03 2016-10-02 2017-04-02 2026-08-12

If you do have overlapping ranges then you create something like a set of all dates on which to perform this operation, and merge it back to IR1 to see which dates fall in this set. You can get a list of all the dates on which to perform this operation like this -

DLSlist2 <- unique(DLSlist[,list(DatesToFix = seq.Date(V1, V2, by = "day")), by = "V1"][,V1 := NULL])

I trust that you will be able to put this logic as a function.

TheComeOnMan
  • 12,535
  • 8
  • 39
  • 54
1

You want to subset with a logical vector. In your initial formulation, the function only returns a single value (instead of a vector), causing your assignment to affect all or none of the rows.

IR <- copy(IR1)
DLStest_old <- function(dd, DLSobj) {
    any(sapply(DLSobj, function(X) dd %between% X))
}

# on the whole tstamp vector at once
  IR[,DLStest_old(tstamp, DLSlist)]
  # TRUE

One solution is to use your function, but apply it "by row":

# by row
  IR[,DLStest_old(tstamp, DLSlist),by=1:nrow(IR)]$V1
  # TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

Note that I'm putting this in the j position of the data.table to return the result. Typically, to subset by an expression it can be put into the i position (before the first comma), however "by" does not apply to i expressions, so for this approach it's probably best to save the logical vector and then subset by it:

 # by row, for use in i
    change_em <- IR[,DLStest_old(tstamp, DLSlist),by=1:nrow(IR)]$V1
    IR[change_em,tstamp:=tstamp+1e15][]

I busted your dates to make the changes more clear, resulting in:

                 tstamp dLoc
 1: ))0'-06-03 15:45:52    1
 2: 2012-04-03 00:00:00    2
 3: 2012-06-01 00:00:00    3
 4: 2012-09-07 00:00:00    4
 5: ))0'-06-03 15:45:52    5
 6: 2013-04-26 00:00:00    6
 7: 2013-05-25 00:00:00    7
 8: ))0'-06-03 15:45:52    8
 9: ))0'-06-03 15:45:52    9
10: ))0'-06-03 15:45:52   10

Another solution that you found is to use something from the *apply family:

DLStest_apply <- function(dd, DLSobj) {
    apply(sapply(DLSobj, function(X) dd %between% X), 1, any)
}

# apply "any" on the margin of the sapply result
  IR[,DLStest_apply(tstamp, DLSlist)]
  # TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

apply is made for matrices and arrays and the result of this sapply is a matrix,

class(sapply(DLSlist, function(X) IR$tstamp %between% X))
# "matrix"

so this should be pretty fast. In general, sapply can return different types of results.


P.S. I think dates are hard to read at a glance and it's best not to use them in your examples if you can tell ahead of time that you don't need them.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • when i tried assignment with the function applied by row it failed -- see my comment to the original question. Could you please provide an example? – ricardo Oct 28 '13 at 20:55
  • @ricardo Ah, you're right, you can't just put that in the i slot. I've edited the answer. Your solution, with apply, is the better way to go, I think. – Frank Oct 28 '13 at 21:56