3

data.table implements asof (also know as rolling or LOCF) joins out of the box. I've found this related question :

Filling in missing (blanks) in a data table, per category - backwards and forwards

but that question has NAs in the data. In my case I'm following the advice there to keep the data irregular and join to it using roll=TRUE. What I'd like to do instead of the last observation carried forward, is the next observation to be carried backward, as efficiently as possible.

This is what I've tried, using time:=-time first to try and trick it. Can I do it better? Can I do it faster?

llorJoin <- function(A,B){
    B <- copy(B);
    keys <- key(A);
    if( !identical(key(A), key(B)) | is.null(keys) ){
       stop("llorJoin::ERROR; A and B should have the same non-empty keys");
    }

    lastKey <- tail(keys,1L);
    myStr <- parse(text=paste0(lastKey,":=-as.numeric(",lastKey,")"));
    A <- A[,eval(myStr)]; setkeyv(A,keys);
    B <- B[,eval(myStr)]; setkeyv(B,keys);

    origin <- "1970-01-01 00:00.00 UTC";
    A <- B[A,roll=T];
    myStr2 <- parse(text=paste0(lastKey,":=as.POSIXct(-",lastKey,",origin=origin)"));
    A <- A[,eval(myStr2)]; setkeyv(A,keys);
    return(A);
}

library(data.table)
A <- data.table(time=as.POSIXct(c("10:01:01","10:01:02","10:01:04","10:01:05","10:01:02","10:01:01","10:01:01"),format="%H:%M:%S"),
                b=c("a","a","a","a","b","c","c"),
                d=c(1,1.9,2,1.8,5,4.1,4.2));
B <- data.table(time=as.POSIXct(c("10:01:01","10:01:03","10:01:00","10:01:01"),format="%H:%M:%S"),b=c("a","a","c","d"), e=c(1L,2L,3L,4L));
setkey(A,b,time)
setkey(B,b,time)

library(rbenchmark)
benchmark(llorJoin(A,B),B[A,roll=T],replications=10)
            test replications elapsed relative user.self sys.self user.child sys.child
1 llorJoin(A, B)           10   0.045        1     0.048        0          0         0
2 B[A, roll = T]           10   0.009        1     0.008        0          0         0

   b                time  e   d
1: a 2013-01-12 09:01:01  1 1.0
2: a 2013-01-12 09:01:02  2 1.9
3: a 2013-01-12 09:01:04 NA 2.0
4: a 2013-01-12 09:01:05 NA 1.8
5: b 2013-01-12 09:01:02 NA 5.0
6: c 2013-01-12 09:01:01 NA 4.1
7: c 2013-01-12 09:01:01 NA 4.2

So as a comparaison, asof join on the initial data is 5X faster.

Community
  • 1
  • 1
statquant
  • 13,672
  • 21
  • 91
  • 162
  • 1
    What specifically is your question? You say you want a last observation carried backward instead of `locf` but you already have an implementation. Does it work? Is it too slow? Are you looking for something more idiomatic? I know nothing about `data.table` package so I cannot help anyway, but with your question in its current form, those potential answerers who **do** use `data.table` are likely as confused as I am. – SlowLearner Jan 12 '13 at 13:52
  • The problem is that to do `B[A,roll=T]`, `A` and `B` have to be sorted. Then reversing `time:=-time` breaks the sort. I tried to gain time in removing the `setkey` sortings afterwards doing the sort "myself" by `A[,time:=time[.N:.1],by="b"]` and using `setattr(A,c("b","time"))`, but it seems longer... – statquant Jan 12 '13 at 14:21
  • 2
    @statquant The point of StackOverflow is to ask clear questions that generate useful answers, so that others can benefit. I have looked through your questions and you do have a tendency to be verbose; this is not a problem of English but a problem of mental organisation. [This one](http://stackoverflow.com/questions/6585820/r-merge-data-frames-asof-join) ended up as a decent question and [this one](http://stackoverflow.com/questions/14076065/data-table-inner-outer-join-with-na-in-join-column-of-type-double-bug) found a bug. So it's useful stuff, but please remember that concision is important! – SlowLearner Jan 12 '13 at 15:04

1 Answers1

4

roll argument can perform nocb since a long time ago. Updating this answer so that #615 can be closed.

You don't need to set keys anymore as well. Instead you can specify the columns to join on using on= argument (implemented in v1.9.6). With these two features, the task can be accomplished as follows:

require(data.table) # v1.9.6+
A[B, on=c("b", "time"), roll=-Inf]
#                   time b  e   d
# 1: 2015-10-11 10:01:01 a  1 1.0
# 2: 2015-10-11 10:01:02 a  2 1.9
# 3: 2015-10-11 10:01:04 a NA 2.0
# 4: 2015-10-11 10:01:05 a NA 1.8
# 5: 2015-10-11 10:01:02 b NA 5.0
# 6: 2015-10-11 10:01:01 c NA 4.1
# 7: 2015-10-11 10:01:01 c NA 4.2

That's it.


You're pretty close to the fastest way without a change to data.table. The following feature request has been filed some time ago :

FR#2300 Add backwards and firstback to roll=TRUE

I've added a link there back to this question. You can search the feature request list on R-Forge. In this case words like "roll", "forwards" and "backwards" all find it. You might need 4 or 5 attempts search attempts to confirm the bug or feature request isn't already filed.

It's probably quicker for me to implement that feature request (only a few lines internally are needed) than try and provide you the quickest possible workaround.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Thanks Matthew, are you upgrading the priority of the feature request or leave it as it is? Can I suggest to change `roll=T` to `roll={"locf","nocb"}` (next observation carried backward) – statquant Jan 14 '13 at 09:47
  • There was an idea somewhere for `roll=-30` (next observation provided not more than diff of 30, likely days), `roll=+30` (locf provided it was within 30 days ago). So the current `TRUE` would correspond to `+Inf`. and `FALSE` to `0`. `-Inf` would be nocb. What you think? – Matt Dowle Jan 14 '13 at 09:54
  • I think it's brilliant, it's adding the additional feature of "window" joining ! – statquant Jan 14 '13 at 10:23