Is there any option to deal with missing Data in a time series, if I want to calculate rolling standard deviations etc

Question

I need to calculate the rolling downside standard deviation (252 days) of a time series with some data points missing.

First I thought of omitting the NAs with na.omit but since there is a lot of missing data, sometimes I'd calculate the downside standard deviation of way more than an actual year. So I defined my own function which turns out to be horribly runtime inefficient

data and output are xts time series.

My function:

sdDown<-function(x){

  x<-x[!is.na(x)]
  x<-x[x<0]
  sdDown<-sd(x)
  return(sdDown)
}

The actual calculation:

for(i in 1:ncol(data){
  output[,i]<-rollapply(data[,i],252,sdDown)
}

A small sample of my data

2018-12-10  -1.20716203625699e-05    0.00040860164054784
2018-12-11  -4.59711501867298e-06   -5.5395807804004e-05
2018-12-12  -2.89544033163936e-06   -2.32695864665396e-05
2018-12-13  -4.6540811777524e-06    -4.09242194254659e-05
2018-12-14  -7.16508767049928e-06   -8.54853569873006e-05
2018-12-17  NA                      -0.000128077030929201
2018-12-18  -3.61378565521999e-06    2.4336464937079e-05
2018-12-19  -7.69458973030011e-06   -6.82301653554002e-05
2018-12-20  -7.90225954459822e-06    NA
2018-12-21  -6.248451592999e-06     -8.99672163515997e-05
2018-12-24  -0.0274338890319172     -0.000100111303817201
2018-12-25  NA                       NA
2018-12-26  NA                       NA
2018-12-27  -0.0210028378713459     -0.0484345012551812
2018-12-28  -4.4543361913201e-06     3.97861322771901e-05
2018-12-31  0.00854680934643593     -5.73227948577303e-05

It does calculate the correct results, however it is horribly runtime inefficient. Is there a more runtime efficient way to do this?

Thanks in advance! Please let me know if the question is phrased poorly.

Greetings Max

The by.column argument of rollapply defaults to TRUE so you can omit the for loop and just run rollapply once. Also you can perform the subsetting before doing the rollapply and then use the fact that width can be vector. See https://stackoverflow.com/questions/57861197/conditional-rolling-sum-of-events-with-ragged-dates/57861655#57861655 for an example. — G. Grothendieck, Sep 12 '19 at 14:14
Also do you know about na.rm=T ? That would be one way to deal with NA. — meh, Sep 12 '19 at 14:15
From what it sounds like (though sample data would be really helpful), your time-series is not of a constant-frequency, so any attempt at using `rollapply` is going to run into problems of over- or under-sampling. (I think.) Most solutions I've used that are truly time-range based are typically inefficient due to needing to calculate the size of the window at each step. — r2evans, Sep 12 '19 at 14:27
That is, not without some effort. Fortunately, `width=` can be a list of widths, so you can define the width per-point. Doing this requires calculating distances between all (or at least many/most) points, which might be expensive (large vectors creates large^2 matrices), so finding a relatively efficient way to do this is (I think) the key. — r2evans, Sep 12 '19 at 14:32
@r2evans. See my comment for a better approach that is not quadratic. — G. Grothendieck, Sep 12 '19 at 14:37
I think it's still quadratic in the calculations if not the memory consumption. That is, if each width will vary based on the actual "time" in the vector, then for each value you need to determine how far to go back/forward in order to have the time-range desired. Whether this is done before-hand or on-the-fly doesn't change the fact that some comparison must be made between each timestamp and all others. Perhaps I misunderstand the variable-width problem and/or your suggestion to subset before rolling? I'd really like to see how it can be done in *O(n)* vice *O(n^2)* or *O(n*log(n))* time. — r2evans, Sep 12 '19 at 14:48
@r2evans. I assume you didn't read the link I posted. It uses `findInterval` to build the width vector which is highly efficient. — G. Grothendieck, Sep 12 '19 at 14:57
Right, my bad, yes that is likely a good approach, so I think it'd be leaning more towards *O(n*log(n))* which is much more manageable than a cruder implementation I had in my head. (The *log(n)* order is, if I understand correctly, because `findInterval` does a good job at finding the best interval, but even with an efficient binary search it still has greater than *O(n)*.) — r2evans, Sep 12 '19 at 15:27
@G.Grothendieck and r2evans thank you so much for your help. I'm currently trying to understand your answer you posted in the link, but I'm sure this will solve my problem! Thank you so much! Very kind! — Plsdontjudgeme111, Sep 12 '19 at 15:30

Is there any option to deal with missing Data in a time series, if I want to calculate rolling standard deviations etc

0 Answers0