7

I am struggling to efficiently perform a "close" date match between two data frames. This question explores a solution using idata.frame from the plyr package, but I would be very happy with other suggested solutions as well.

Here is a very simplistic version of the two data frames:

sampleticker<-data.frame(cbind(ticker=c("A","A","AA","AA"),
  date=c("2005-1-25","2005-03-30","2005-02-15","2005-04-21")))
sampleticker$date<-as.Date(sampleticker$date,format="%Y-%m-%d")

samplereport<-data.frame(cbind(ticker=c("A","A","A","AA","AA","AA"),
  rdate=c("2005-2-15","2005-03-15","2005-04-15",
  "2005-03-01","2005-04-20","2005-05-01")))
samplereport$rdate<-as.Date(samplereport$rdate,format="%Y-%m-%d")

In the actual data, sampleticker is over 30,000 rows with 40 columns, and samplereport almost 300,000 rows with 25 columns.

What I would like to do is to merge the two data frames so that each row in sampleticker is combined with the closest date match in samplereport which occurs AFTER the date in sampleticker. I have solved similar problems in the past by doing a simple merge on the ticker field, sorting ascending, and then selecting unique combinations of ticker and date. However, due to the size of this dataset, the merge blows up extremely quickly.

As near as I can tell, merge does not allow this sort of approximate matching. I have seen some solutions which use findInterval, but since the distance between the dates will vary, I am not sure that I can specify an interval that will work for all rows.

Following another post here, I have written the following code to use adply on each row and to perform the join:

library(plyr)
merge<-adply(sampleticker,1,function(x){
  y<-subset(samplereport,ticker %in% x$ticker & rdate > x$date)
  y[which.min(y$rdate),]
  }))

This works quite nicely: for the sample data, I get the below, which is what I want.

   date       ticker      rdate
 1 2005-01-25  A          2005-02-15
 2 2005-03-30  A          2005-04-15
 3 2005-02-15  AA         2005-03-01
 4 2005-04-21  AA         2005-05-01

However, since the code performs 30,000+ subsetting operations, it is extremely slow: I ran the above query for more than a day before finally killing it.

I see here that plyr 1.0 has a structure, idata.frame, which calls the dataframe by reference, dramatically speeding up the subsetting operation. However, I cannot get the following code to work:

isamplereport<-idata.frame(samplereport)
adply(sampleticker,1,function(x){
  y<-subset(isamplereport,isamplereport$ticker %in% x$ticker & 
    isamplereport$rdate > x$date)
  y[which.min(y$rdate),]
})

I get the error

Error in list_to_dataframe(res, attr(.data, "split_labels")) : 
Results must be all atomic, or all data frames

This makes sense to me, since the operation returns an idata.frame (I assume). However, changing the last line to:

as.data.frame(y[which.min(y$rdate),]) 

also throws an error:

Error in `[.data.frame`(x$`_data`, x$`_rows`, x$`_cols`) : 
undefined columns selected.

Note that calling as.data.frame on the plain old samplereport returns the original data frame, as expected.

I know that idata.frame is experimental, so I didn't necessarily expect it to work properly. However, if anyone has an idea on how to fix this, I would appreciate it. Alternately, if anyone could suggest a completely different approach that runs more efficiently, that would be fantastic.

Matt

UPDATE Data.table is the right way to go about this. See below.

Community
  • 1
  • 1
Matt
  • 537
  • 5
  • 19

3 Answers3

8

Thanks to Matthew Dowle and his addition of the ability to roll backwards as well as forwards in data.table, it is now much simpler to perform this merge.

ST <- data.table(sampleticker)
SR <- data.table(samplereport)
setkey(ST,ticker,date)
SR[,mergerdate:=rdate]
setkey(SR,ticker,mergerdate)
merge<-SR[ST,roll=-Inf]
setnames(merge,"mergerdate","date")

#    ticker       date      rdate
# 1:      A 2005-01-25 2005-02-15
# 2:      A 2005-03-30 2005-04-15
# 3:     AA 2005-02-15 2005-03-01
# 4:     AA 2005-04-21 2005-05-01
Matt
  • 537
  • 5
  • 19
6

Here is a data.table-based solution that's likely to work better than what you are currently using:

library(data.table)
ST <- data.table(sampleticker, key="ticker")
SR <- data.table(samplereport, key="ticker")
SR <- SR[with(SR, order(ticker, rdate)),] # rdates need to be in increasing order

SR[ST, list(date = date,
            rdate = rdate[match(TRUE, (rdate > date))]), ]
     ticker       date      rdate
[1,]      A 2005-01-25 2005-02-15
[2,]      A 2005-03-30 2005-04-15
[3,]     AA 2005-02-15 2005-03-01
[4,]     AA 2005-04-21 2005-05-01

Of course, it sounds like what you really want to do is to merge together two much wider data.frames. To demonstrate one way of accomplishing that, in the example below, I add some columns to both data.tables, and then show how you could merge the appropriate rows:

# Add some columns to both data.tables
ST$alpha <- letters[seq_len(nrow(ST))]
SR$n     <- seq_len(nrow(SR))
SR$ALPHA <- LETTERS[seq_len(nrow(SR))]

# Perform a merge that includes the whole rows from samplereport
# corresponding to the selected rdate
RES <- SR[ST, cbind(date, .SD[match(TRUE,(rdate>date)),-1]), ]

# Merge res (containing the selected rows from samplereport) back together
# with sampleticker
keycols <- c("ticker", "date")
setkeyv(RES, keycols)
setkeyv(ST, keycols)
ST[RES]
#      ticker       date alpha      rdate n ALPHA
# [1,]      A 2005-01-25     a 2005-02-15 1     A
# [2,]      A 2005-03-30     b 2005-04-15 3     C
# [3,]     AA 2005-02-15     c 2005-03-01 4     D
# [4,]     AA 2005-04-21     d 2005-05-01 6     F
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • `roll=TRUE` is designed for this really. To get the nearest _after_ then perhaps `X[ X[Y,roll=TRUE,which=TRUE]+1 ]`, or reverse it and do `Y[X,roll=TRUE]`. – Matt Dowle Feb 14 '12 at 10:08
  • But the next _after_ requirement is very rare. In practice `mult="last"` or `DT[J(date,23:00),roll=TRUE]` is usually better. – Matt Dowle Feb 14 '12 at 10:34
  • I will actually be using the "next after" requirement a great deal; it is not uncommon to select the first record after an event to gauge the response to the event. I will play around with using `roll=TRUE` to achieve the right effect. My preliminary (probably incorrect) attempts have been giving me the opposite: `samplereport` which is nearest BEFORE `sampleticker`, but reversing the procedure gives me all of the records in `samplereport` with a bunch of NAs, which is definitely not what I want. I will have to learn more about data.table, since it looks very useful. – Matt Feb 14 '12 at 16:53
  • @Matt -- That clarification helps. Since you are also looking at `roll=TRUE`, I will take the time to post a solution I figured out that uses it. Please let me know if it works on your data. – Josh O'Brien Feb 14 '12 at 16:57
  • Thanks Josh. I think that your solution above also works; I'm pretty sure that I had it working until I started playing around with `roll=TRUE`, and then I screwed it up. I am going to try to restore your solution now. I would definitely be interested in a `roll=TRUE` solution as well. It is more concise, and seems to be significantly faster. – Matt Feb 14 '12 at 17:19
  • @Matt -- Good. The other answer is done now. Not sure if my explanation of it is successful, and it's certainly not succinct. If you do get both methods set up, do let us know whether there's a significant performance difference between them. Cheers. – Josh O'Brien Feb 14 '12 at 17:49
  • I accepted this answer because it seemed like more of a "best practice" than the other solution: it is more intuitive. It also produces the prettiest output without any tweaking. For people who want to use this solution, be advised that, for my data, I was required to `setkeyv(SR,c("ticker","rdate")` before performing the `RES` merge. – Matt Feb 14 '12 at 21:41
  • There is, however, a very large performance difference between the two solutions. Using my data, this solution took about 200 seconds. This is a massive improvement over what I had before, so thanks very much for helping me figure this out. However, the `roll=TRUE` solution takes 1.16 seconds, which certainly validates the approach. – Matt Feb 14 '12 at 21:43
  • Matt -- Thanks for providing those timings. That's very useful info. Also, @MatthewDowle (if you care to comment) does the `roll=TRUE` solution I came up with below look about optimal, or is there some simpler incantation that I'm overlooking? Thanks to both of you. – Josh O'Brien Feb 15 '12 at 21:18
  • @Matt Quick minor comment that `setkey(SR,ticker,rdate)` is easier (8 characters less typing) than `setkeyv(SR,c("ticker","rdate"))`. `setkeyv` is just intended for when you have to pass in a key as a variable programmatically really. – Matt Dowle Feb 16 '12 at 10:32
4

Here's a solution that follows up on Matthew Dowle's observation that this is a natural place to apply data.table's roll=TRUE argument.

If you are to apply it, there is one wrinkle to be ironed out. roll=TRUE is designed such that when an exact match isn't found for the last column of the key (here date), the value from the nearest previous date will be rolled forward. You, though, want the opposite (and even when there is an exact match, you still want the value from the next available date).

A first attempt might be to sort by "ticker", and the by "rdate" in reverse order, merging with the resulting reordered SR. That would work, except that data.table doesn't want to let you sort in reverse order: keying by "rdate" forces that column into ascending order. (data.table needs to do that in order to implement the speedy matching and joining for which it was designed).

My solution below, is to create a new column -- "rnd", for "reverse numerical date" -- in both data.tables, whose values are formed by doing -as.numeric(date). This assigns a unique value to each date. Moreover, because the values have been multiplied by -1, sorting them in ascending order has the effect of sorting dates in descending order.

(One other detail: because you don't want exact matches, and instead always want the next date after the current one, I've subtracted 1 from sampleticker's rnd, which has the desired effect. To confirm that it's doing its job correctly, I slightly edited your example data to include one possible exact match ("2005-1-25") which should not be selected by the merge).

# Create sample data.tables
library(data.table)

ST <- data.table(ticker = c("A","A","AA","AA"),
                 date = as.Date(c("2005-1-25","2005-03-30","2005-02-15",
                                  "2005-04-21"), format="%Y-%m-%d"),
                 alpha = letters[1:4])    

SR <- data.table(ticker = c("A","A","A","AA","AA","AA"),
                 rdate = as.Date(c("2005-1-25","2005-03-15","2005-04-15",
                                   "2005-03-01","2005-04-20","2005-05-01"), 
                                   format="%Y-%m-%d"),
                 ALPHA = LETTERS[1:6])

With sample data in hand, set up for and perform the desired merge:

# Create a "reverse numerical date" column, which will uniquely
# identify date, and allow them to be sorted in reverse temporal order
ST$rnd <- -(as.numeric(ST$date) + 1)
SR$rnd <- -(as.numeric(SR$rdate))

# key (and thus sort) both data.tables by ticker and "reverse numerical date"
keycols <- c("ticker", "rnd")
setkeyv(ST, keycols)
setkeyv(SR, keycols)

# The syntax of the merge is now as simple as can be
res <- SR[ST, roll=TRUE]

# Finally, put the results back in temporal order, and pretty up the column order
setkeyv(res, c("ticker", "date"))
setcolorder(res, c("ticker", "date", "rdate", "alpha", "ALPHA", "rnd"))
res
#      ticker       date      rdate alpha ALPHA    rnd
# [1,]      A 2005-01-25 2005-03-15     a     B -12809
# [2,]      A 2005-03-30 2005-04-15     b     C -12873
# [3,]     AA 2005-02-15 2005-03-01     c     D -12830
# [4,]     AA 2005-04-21 2005-05-01     d     F -12895
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • This is a great solution as well, and it operates much more quickly than the accepted solution. – Matt Feb 14 '12 at 21:44
  • Very nice! I didn't fully check it but I get the concept. Ok, you guys convinced me: `[.data.table` needs a new argument to roll the next observation back, doesn't it. That's an easy switch internally, btw. Options: `revroll`, `rollback`, `rollbacktofirst`, `next`,`after` or some combination? Or rather than new argument(s), `roll=-1|0|1` would mean `after|equal|previous` where `TRUE` and `FALSE` would need no change since they map to previous and equal respectively. – Matt Dowle Feb 16 '12 at 10:49
  • Sorry for the late response. It would be great if you could add this additional argument! For my current purposes, adding an option to `roll=-1` would be sufficient. However, I can see where you might need `rolltofirst` or something like it, so it might make sense to add whatever functionality is necessary to duplicate the existing `roll` functions in the opposite direction. Thanks for the consideration! – Matt Mar 27 '12 at 15:06