I would like to see what people use to work with panel data in R with large datasets (ie 50 mil + obs): the data.table
package is useful in that it has keys and is very fast. The xts
package is useful because it has facilities for doing all sorts of time series stuff. Therefore, it seems there are two good options:
- have a
data.table
and write custom time series functions to work on that - have a list of
xts
objects and runlapply
on that list every time you want to do something. eventually this will need to be merged into adata.frame
to do regressions etc.
I am aware of the plm
package but have not found it as useful for data management as the two options above. What do you guys use? Any ideas on what works best when?
Let me propose a scenario: imagine having N firms with T time periods, where N>>0 and T>>0. data.table
will be super fast if I want to lag each firm by 1 time period, for example:
x <- data.table(id=1:10, dte=rep(seq(from=as.Date("2012-01-01"), to=as.Date("2012-01-10"), by="day"), each=10), val=1:100, key=c("id", "dte"))
x[,lag_val:=c(NA, head(val, -1)),by=id]
Another way to do this might be:
y <- lapply(ids, function(i) {xts(x[id==i, val], order.by=x[id == i, dte])})
y <- lapply(y, function(obj) { cbind(obj, lag(obj, 1)) })
The advantage of the former is it's speed with big data. The advantage of the latter is the ability to do things like period.apply
and use other functionality of xts
. Are there tricks to making the xts
representation faster? Maybe a combination of the two? Converting from and to xts
objects is costly, it seems.