6

I would like to see what people use to work with panel data in R with large datasets (ie 50 mil + obs): the data.table package is useful in that it has keys and is very fast. The xts package is useful because it has facilities for doing all sorts of time series stuff. Therefore, it seems there are two good options:

  1. have a data.table and write custom time series functions to work on that
  2. have a list of xts objects and run lapply on that list every time you want to do something. eventually this will need to be merged into a data.frame to do regressions etc.

I am aware of the plm package but have not found it as useful for data management as the two options above. What do you guys use? Any ideas on what works best when?

Let me propose a scenario: imagine having N firms with T time periods, where N>>0 and T>>0. data.table will be super fast if I want to lag each firm by 1 time period, for example:

x <- data.table(id=1:10, dte=rep(seq(from=as.Date("2012-01-01"), to=as.Date("2012-01-10"), by="day"), each=10), val=1:100, key=c("id", "dte"))
x[,lag_val:=c(NA, head(val, -1)),by=id]

Another way to do this might be:

y <- lapply(ids, function(i) {xts(x[id==i, val], order.by=x[id == i, dte])})
y <- lapply(y, function(obj) { cbind(obj, lag(obj, 1)) })

The advantage of the former is it's speed with big data. The advantage of the latter is the ability to do things like period.apply and use other functionality of xts. Are there tricks to making the xts representation faster? Maybe a combination of the two? Converting from and to xts objects is costly, it seems.

Alex
  • 19,533
  • 37
  • 126
  • 195
  • Could you clarify what you mean with an example? – Matt Dowle Nov 02 '12 at 21:56
  • @MatthewDowle: sure, let me think of something and I'll post it up! – Alex Nov 02 '12 at 22:19
  • This seems like a general question and it's difficult to answer. For `period.apply` I tend to pass function calls to `by` such as `DT[...,by=month(date)]`. – Matt Dowle Nov 04 '12 at 14:43
  • yes but the beauty of `period.apply` is that one can do "every 3 weeks", for example. is there a way to do this with `by`? i think my goal is to find out if there is a general methodology people use to work with panel data or if it's a case by case basis. – Alex Nov 04 '12 at 15:21
  • With any epoch based date, every 3 weeks is date%/%21. Sometimes I aggregate monthly on yyyymmdd integer dates using %/%100L. Rightly or wrongly I think of 'by' being flexible to take any date / agg / cut functions from base or packages. Maybe it could be easier / built in. – Matt Dowle Nov 04 '12 at 17:52
  • interesting. this could be a good way to avoid using `xts` for simple things which i think is slower – Alex Nov 04 '12 at 19:03
  • @Alex, **xts** is pretty fast as it's mostly written in C. Somethings are faster than others, but try to beat the speed of `to.period`, or `period.max`. What things in **xts** do you think are slow? – GSee Nov 04 '12 at 20:53
  • 1
    Why not have each firm be a column? `XTS <- xts(matrix(1:100, ncol=10), rep(seq(from=as.Date("2012-01-01"), to=as.Date("2012-01-10"), by="day"))); lag(XTS)`. I don't see why you need to convert to a `data.frame` ever. some models are slower on `xts` than `matrix`, but I don't think `as.matrix` costs you anything (if it does, I think you can just borrow `setattr` from **data.table**) – GSee Nov 04 '12 at 21:28
  • For this simple example, of 10 firms, but with 100,000 dates, I found that lagging the values by date using `lag.xts` is more than 50 times faster. That doesn't include assigning the result or the time to create the (`xts` or `data.table`) objects, both of which `data.table` is likely to do better. – GSee Nov 04 '12 at 22:07
  • Wouldn't it be a good idea to switch to Python+Pandas at that scale, being able to call R functions from it? – Alex Nov 05 '12 at 00:27
  • @GSee: great discussion! One reason I do not switch to `xts` currently is because if i have things in the wide format as you suggest, i can only have 1 value in the table. I guess you would suggest to have multiple `xts` objects contain the things I plan to use? – Alex Nov 05 '12 at 03:08
  • @Alex I'm not sure this is a very great discussion. Seems like hand waving to me. This is S.O. where the idea is to have clear questions with clear answers. It specifically _isn't_ for _discussion_, iiuc. Have voted to close (couldn't decide between _likely solicit debate_ or _overly broad_). – Matt Dowle Nov 05 '12 at 10:57
  • 1
    @MatthewDowle: i think my question is fairly clear. if one wants to work with panel data, is option 1 or 2 the best? or is there another? – Alex Nov 05 '12 at 18:13
  • 2
    I still think this question is pretty important. Panel data is everywhere in some fields - and R hardly supports it. It would be great to have a discussion about it somewhere. I asked a bunch of specific questions : http://stackoverflow.com/questions/26171958/fill-in-missing-values-by-group-in-data-table, http://stackoverflow.com/questions/25649056/create-lagged-variable-in-unbalanced-panel-data-in-r/26108191#26108191 http://stackoverflow.com/questions/25694940/from-unbalanced-to-balanced-panel/25697841#25697841 – Matthew Oct 21 '14 at 18:09

0 Answers0