Rolling cross join with data.table using nested IDs and missing data

Question

I have a dataset of portfolio holdings:

# Input test data
portolios <- structure(list(portfolioid = c(1L, 1L, 1L, 1L, 1L, 1L), secid = c("A", "B", "A", "C", "C", "A"), reportdate = c("2010-03-31", "2010-03-31", "2010-06-30", "2010-06-30", "2010-07-15", "2010-08-31"), report_type = c("Full", "Full", "Full", "Full", "Partial", "Full"), shares = c(100L, 100L, 130L, 50L, 75L, 80L)), .Names = c("portfolioid", "secid", "reportdate", "report_type", "shares"), row.names = c(NA, -6L), class = c("data.table", "data.frame"))

 portfolioid secid reportdate report_type shares
1:           1     A 2010-03-31        Full    100
2:           1     B 2010-03-31        Full    100
3:           1     A 2010-06-30        Full    130
4:           1     C 2010-06-30        Full     50
5:           1     C 2010-07-15     Partial     75
6:           1     A 2010-08-31        Full     80

I need to impute the following missing records:

7:           1    B 2010-06-30       Full       0
8:           1    C 2010-08-31       Full       0

The business issue is that sales of positions (shares = 0) are sometimes not reported for Full report_type, so the missing SecID have to be imputed based on the prior report.

Ultimately I am seeking to calculate the change in shares for each SecID from the prior report for each portfolioID such that my dataset looks like this:

changes <- structure(list(portfolioid = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), secid = c("A", "B", "A", "B", "C", "C", "A", "C"), reportdate = c("2010-03-31", "2010-03-31", "2010-06-30", "2010-06-30", "2010-06-30", "2010-07-15", "2010-08-31", "2010-08-31"), report_type = c("Full", "Full", "Full", "Full", "Full", "Partial", "Full", "Full"), shares = c(100L, 100L, 130L, 0L, 50L, 75L, 80L, 0L), change = c(100L, 100L, 30L, -100L, 50L, 25L, -50L, -75L)), .Names = c("portfolioid", "secid", "reportdate", "report_type", "shares", "change"), row.names = c(NA, -8L), class = c("data.table", "data.frame"))

   portfolioid secid reportdate report_type shares change
1:           1     A 2010-03-31        Full    100    100
2:           1     B 2010-03-31        Full    100    100
3:           1     A 2010-06-30        Full    130     30
4:           1     B 2010-06-30        Full      0   -100
5:           1     C 2010-06-30        Full     50     50
6:           1     C 2010-07-15     Partial     75     25
7:           1     A 2010-08-31        Full     80    -50
8:           1     C 2010-08-31        Full      0    -75

I am stuck with how to create the i for the outer join portfolios[i]. My issue is that I don't want to use i <- CJ(reportdate, secid) because it will produce too many records that are unnecessary since not every secid exists at every ReportDate and not correctly represent the data that needs filling.

I think I need a rolling cross-join between reportdate,reportdate[-1,secid]

I want to roll forward the secid and set shares := 0 when secid is missing in a Full report but it existed in prior report (either Partial or Full). I believe I would do this with option roll=1 but I am not sure where or how to implement.

I think my problem is similar to

How to Calculate a rolling statistic in R using data.table on unevenly spaced data

I am sure I am missing something basic in my understanding or a trick with CJ() that can create the requisite i

What are you want to join `portolios` to? To your older report? Where is it? You didn't provide it. Or you want to roll join to these specific two observations? It is very unclear (at least to me) what's going on. It would be easier if you'd just provide the data sets you have and then your desired output from the join — David Arenburg, Dec 14 '14 at 19:38
@davidarenburg the only data provided is the portfolios table. I need to create or derive from this a table to join to, to implement the logic I described. The desired output is also shown as the changes table I provided. — superquant, Dec 14 '14 at 20:10
I don't understand how do you know which are the missing records. You only specify `shares := 0` but how do you know the `secid` and the date missing? Why `1 B 2010-08-31 Full 0` is not missing too, for example? — David Arenburg, Dec 14 '14 at 20:18
@davidarenburg the missing records have to be imputed based on this rule: if a `secid` existed in a prior `reportdate` but is missing in the following `Full reportdate` then this `secid` should be carried forward and `shares:=0`. The trick is how to calculate which `secid` should be carried forward from the prior `reportdate`. This is why i believe `roll=1` may be part of the solution. — superquant, Dec 14 '14 at 20:29
@davidarenburg the reason `1 B 2010-08-31 Full 0` is not missing is because B is not in the prior report of `2010-06-30` in the input dataset. You could also look at it as only wanting to carry forward to the next period `secid` where `shares <> 0`. You do not want to carry forward filled missing values for `shares = 0` — superquant, Dec 14 '14 at 21:17

Mike.Gahan · Accepted Answer · 2014-12-15T13:03:28.733

1

Something like this should work (if I understand correctly)

First set reportdate with date class. Also get unique dates

portolios[, reportdate := as.IDate(reportdate)]
uniq.dts <- unique(portolios$reportdate)
uniq.dts <- uniq.dts[order(uniq.dts)]

Perform a self-join for each i and extract only dates occuring after we know the secid already exists (should be more memory efficient than a CJ)

setkey(portolios,secid)
setorder(portolios,sec,id,reportdate)

impute <- portolios[portolios, {
      tmp = max(reportdate) < uniq.dts;
        list(portfolioid=1,reportdate=uniq.dts[tmp][1],report_type="Full",shares=0)
},by=.EACHI][!is.na(reportdate)][,.SD[1],by=secid]

Next, rbindlist original table and the impute table.

portolios <- rbindlist(list(portolios,impute),fill=TRUE)

#Order data by secid and reportdate
portolios <- portolios[order(secid,reportdate)]

#Lag data by group
portolios[, prev.shares := c(NA,lag(shares)), by=secid]

#Calculate change WHEN a previous share amount exists
portolios[, change := ifelse(is.na(prev.shares),shares,shares-prev.shares), by=secid]

print(portolios[order(reportdate)])
   portfolioid secid reportdate report_type shares prev.shares change
1:           1     A 2010-03-31        Full    100          NA    100
2:           1     B 2010-03-31        Full    100          NA    100
3:           1     A 2010-06-30        Full    130         100     30
4:           1     B 2010-06-30        Full      0         100   -100
5:           1     C 2010-06-30        Full     50          NA     50
6:           1     C 2010-07-15     Partial     75          50     25
7:           1     A 2010-08-31        Full     80         130    -50
8:           1     C 2010-08-31        Full      0          75    -75

edited Dec 15 '14 at 13:03

answered Dec 14 '14 at 22:37

Mike.Gahan

4,565
23
39

thanks @mike.gahan this is close but how are you creating the rows `4: 1 B 2010-06-30 Full 0 100 -100` and `8: 1 C 2010-08-31 Full 0 75 -75` in your `portfolios` result? These rows are missing in the original `portfolios` structure and need to be imputed / created, setting shares := 0. Creating these missing rows is the part that I am stuck on. – superquant Dec 15 '14 at 01:59
So does `C 2010-08-31` need to be imputed as well? If not, why? – Mike.Gahan Dec 15 '14 at 02:22
yes .. there are 6 rows in my `portfolios` input data and 8 in the result I would like. If a `secid` was in a previous `reportdate` but is missing in a subsequent `Full` `reportdate` for the same `portfolio` then the missing `shares` is replaced with `0`. The challenge is to fill the missing `secid` as well. – superquant Dec 15 '14 at 02:26
I think I understand better now. Might be useful to do a self-join for each i, not a CJ. – Mike.Gahan Dec 15 '14 at 04:19
Does the routine iterate by `portfolioid` when creating `impute`? Trying to understand this line better `list(portfolioid=1,reportdate=uniq.dts[tmp][1],report_type="Full",shares=0) },by=.EACHI` – superquant Dec 15 '14 at 15:56
The function inside the `{` is acting on each `secid` (the key). Inside the `{`, it is comparing all the unique dates in the full dataset to the dates within each `secid`. If there are dates in the data that do not exist within the `secid`, it creates a record with the date and shares. I assumed the portfolioid was a constant 1. – Mike.Gahan Dec 15 '14 at 17:02
Thanks - the `portfolioid` is itself a variable as there are numerous portfolios and the idea is to fill the missing records and calculate changes for each one. So the list of unique `reportdates` should be grouped by `portfolioid` as opposed to the entire dataset .. then for each `portfolioid` do the imputation and calculation of `change` – superquant Dec 15 '14 at 21:07

Rolling cross join with data.table using nested IDs and missing data

1 Answers1