4

I'm trying to replace NA & zero values recursive. Im working on time series data where a NA or zero is best replaced with the value previous week (every 15min measurement so 672 steps back). My data contains ~two years data of 15min values, thus this is a large set. Not much NA or zeros are expected and adjacent series of zero's or NA >672 are also not expected.

I found this thread (recursive replacement in R) where a recursive way is shown, adapted it to my problem.

load[is.na(load)] <- 0
o <- rle(load)
o$values[o$values == 0] <- o$values[which(o$values == 0) - 672]
newload<-inverse.rle(o)

Now is this "the best" or an elegant method? And how will I protect my code from errors when a zero value occurs within the first 672 values?

Im used to matlab, where I would do something like:

% Replace NaN with 0
Load(isnan(Load))=0;
% Find zero values
Ind=find(Load==0);
for f=Ind
    if f>672
    fprintf('Replacing index %d with the load 1 day ago\n', Ind)
    % Replace zero with previous week value
    Load(f)=Load(f-672);
    end
end

As im not familiar to R how would i set such a if else loop up?

A reproducible example(change the code as the example used from other thread didnt cope with adjacent zeros):

day<-1:24
load<-rep(day, times=10)
load[50:54]<-0
load[112:115]<-NA
load[is.na(load)] <- 0
load[load==0]<-load[which(load == 0) - 24]

Which gives the original load dataframe without zero's and NA's. When in the first 24 values a zero exist, this goes wrong because there is no value to replace with:

loadtest[c(10,50:54)]<-0 # instead of load[50:54]<-0 gives:

Error in loadtest[which(loadtest == 0) - 24] : 
only 0's may be mixed with negative subscripts

Now to work around this an if else statement can be used, but i dont know how to apply. Something like:

day<-1:24
loadtest<-rep(day, times=10)
loadtest[c(10,50:54)]<-0
loadtest[112:115]<-NA
loadtest[is.na(loadtest)] <- 0 
if(INDEX(loadtest[loadtest==0])<24) {
     # nothing / mean / standard value
    } else {
      loadtest[loadtest==0]<-loadtest[which(loadtest == 0) - 24]
    } 

Ofcourse INDEX isnt valid code..

Community
  • 1
  • 1
  • If I'm correct this replaces a NA with the last non-NA, which is not my goal. I want it replaced by a recursive value "Generic function for replacing each NA with the most recent non-NA prior to it." – Peter Nijhuis Sep 17 '13 at 14:11
  • Oh, my mistake... its before my coffee! – Justin Sep 17 '13 at 14:15
  • 2
    Please provide a [simplified example](http://stackoverflow.com/a/5963610/1412059) (no need for 672 values) and the expected result. – Roland Sep 17 '13 at 14:17
  • idx <- which(loadtest == 0);idx <- idx[which(idx>24)]; loadtest[idx] <- loadtest[idx-24] – Wojciech Sobala Sep 17 '13 at 16:56

2 Answers2

1

You can use this example:

set.seed(42)

x <- sample(c(0,1,2,3,NA), 100, T)

stepback <- 6

x_old <- x
x_new <- x_old

repeat{
    filter <- x_new==0 | is.na(x_new)
    x_new[filter] <- c(rep(NA, stepback), head(x_new, -stepback))[filter]
    if(identical(x_old,x_new)) break
    x_old <- x_new
}

x
x_new

Result:

> x
  [1] NA NA  1 NA  3  2  3  0  3  3  2  3 NA  1  2 NA NA  0  2  2 NA  0 NA NA  0
 [26]  2  1 NA  2 NA  3 NA  1  3  0 NA  0  1 NA  3  1  2  0 NA  2 NA NA  3 NA  3
 [51]  1  1  1  3  0  3  3  0  1  2  3 NA  3  2 NA  0  1 NA  3  1  0  0  1  2  0
 [76]  3  0  1  2  0  2  0  1  3  3  2  1  0  0  1  3  0  1 NA NA  3  1  2  3  3
> x_new
  [1] NA NA  1 NA  3  2  3 NA  3  3  2  3  3  1  2  3  2  3  2  2  2  3  2  3  2
 [26]  2  1  3  2  3  3  2  1  3  2  3  3  1  1  3  1  2  3  1  2  3  1  3  3  3
 [51]  1  1  1  3  3  3  3  1  1  2  3  3  3  2  1  2  1  3  3  1  1  2  1  2  3
 [76]  3  1  1  2  2  2  3  1  3  3  2  1  3  1  1  3  2  1  3  1  3  1  2  3  3

Note that some values are still NA, because there is no prior information to use for them. If your data has sufficient prior information, this will not happen.

Ferdinand.kraft
  • 12,579
  • 10
  • 47
  • 69
  • I was thinking of the core replacement as `xnew[which(is.na(x)|x==0)]<- x[(which(is.na(x)|x==0)-stepback)]` , which is basically the same thing. Dunno which is faster. There's still a risk that some `NA` values will "look back" to `NA` in the first few spots, in which case they can never be replaced. That's a failure of the OP to properly define his initial conditions, tho', not a bug in your solution. – Carl Witthoft Sep 17 '13 at 16:10
  • @CarlWitthoft the problem with indices `(which(is.na(x)|x==0)-stepback)` is that it may have negatives, and this either throws an error (if there are also positive indices) or (worse) silently puts garbage in the answer (if there are only negatives). – Ferdinand.kraft Sep 17 '13 at 16:57
  • True enough. My approach to any recursive reference problem like this is to start with Step One: clean up the boundary conditions. Or add `max(1,which(whateverconditions))` – Carl Witthoft Sep 17 '13 at 17:14
  • @CarlWitthoft `pmax` :-) – Ferdinand.kraft Sep 17 '13 at 18:17
  • Yep, my bad. Off to get more chocolate – Carl Witthoft Sep 17 '13 at 18:33
  • My bad for the incompleteness of the question, Im new ;). As I correctly understand your solution, you are labeling all values with TRUE or FALSE. Next you set up a row vector which should represent the replace values, thus 6 steps back. This contains first 6 NA values due to no prior values. The rest of the row are the normal values 6 steps back. To use of this replacement function is for high frequency data, probably replacing the whole set with TRUE / FALSE labels could make the script slow? There aren't much NA or zero's and two weeks of adjacent 0's is not expected. I will adjust my Q. – Peter Nijhuis Sep 18 '13 at 08:37
  • @PeterNijhuis, yes, that is what happens. Computing of the TRUE/FALSE "labels" can't be avoided, as you need to know which values must be replaced. And it is fast, vectorized, should not be a bottleneck. After that, replacement affects only NA/zero positions. Also my code is optimized for a scenario with few cases of an observation being replaced with another zero/NA; so the `while` won't loop more than a couple iterations. – Ferdinand.kraft Sep 18 '13 at 15:28
1

One option would be to wrap your vector into a matrix with 672 rows:

load2 <- matrix(load, nrow=672)

Then apply the last observation carried forward (either from zoo, or the method above, or ...) to each row of the matrix:

load3 <- apply( load2, 1, locf.function )

Then take the resulting matrix back to a vector with the correct length:

load4 <- t(load3)[ seq_along(load) ]
Greg Snow
  • 48,497
  • 6
  • 83
  • 110
  • Why not simply `load4 <- as.vector(t(load3))`? Just a matter of taste? :-) – Ferdinand.kraft Sep 18 '13 at 15:23
  • @Ferdinand.kraft, that will work fine if the length of `load` is a multiple of 672, but if it is not then the converting to a matrix (`load2`) will recycle some of the values to fill in the last column, they won't mess up the apply step, but if they are kept in `load4` then you will have extra values from the 1st part of the series added to the end which could really mess up an analysis. My version strips those off if they exist. – Greg Snow Sep 18 '13 at 18:09