R ifelse condition: frequency of continuously NA

Question

I'm new to R, and I was looking for similar questions, but was not able to find one to fix mine, any help would be appreciated.

I have a data frame M:

            date value
1 182-2002-01-01 23.95
2 182-2002-01-02 17.47
3 182-2002-01-03  NA
4 183-2002-01-01  NA
5 183-2002-01-02  5.50
6 183-2002-01-03 17.02

What I need to do is: if there are less than 5 NA (continuously), I will just repeat the previous number(17.47), and if there are more than 5 NA in a row, I will need to delete the whole month.

I tried function rle many times, but didn't work, many thanks for your help.

Welcome to Stack Overflow! You will find that you get better answers if you take the time to make your question reproducible. Please follow the guidelines (http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), paying special attention to the part about `dput()`. Thanks! — Ari B. Friedman, Jul 28 '12 at 02:45

sebastian-c · Accepted Answer · 2012-08-02T01:42:20.577

I'm going to adjust your question a little bit for the purposes of demonstration. I'm going to use a similar dataset to you, but for 2 NAs in a row. This generalises to 5 very easily, don't worry. I'm also going to use a data set that better demonstrates the solution

So first, how to get your data to look like what I'm going to use:

library(reshape)
M2<-data.frame(colsplit(M$date, "-", c("ID", "year", "month", "day")), 
               value=M$value)

Now that's out of the road, this is the data I'm going to work with:

library(reshape)
M2<-data.frame(colsplit(M$date, "-", c("ID", "year", "month", "day")), 
               value=M$value)

set.seed(1234)
M2<-expand.grid(ID=182, year=2002:2004, month=1:2, day=1:3, KEEP.OUT.ATTRS=FALSE)
M2 <- M2[with(M2, order(year, month, day, ID)),] #sort the data
M2$value <- sample(c(NA, rnorm(100)), nrow(M2), 
                   prob=c(0.5, rep(0.5/100, 100)), replace=TRUE)
M2

    ID year month day      value
1  182 2002     1   1 -0.5012581
7  182 2002     1   2  1.1022975
13 182 2002     1   3         NA
4  182 2002     2   1 -0.1623095
10 182 2002     2   2  1.1022975
16 182 2002     2   3 -1.2519859
2  182 2003     1   1         NA
8  182 2003     1   2         NA
14 182 2003     1   3         NA
5  182 2003     2   1  0.9729168
11 182 2003     2   2  0.9594941
17 182 2003     2   3         NA
3  182 2004     1   1         NA
9  182 2004     1   2 -1.1088896
15 182 2004     1   3  0.9594941
6  182 2004     2   1 -0.4027320
12 182 2004     2   2 -0.0151383
18 182 2004     2   3 -1.0686427

First, we're going to remove all cases where, within a month, there are 2 or more NAs in a row:

NA_run <- function(x, maxlen){
  runs <- rle(is.na(x$value))
  if(any(runs$lengths[runs$values] >= maxlen)) NULL else x
  }

library(plyr)
rem <- ddply(M2, .(ID, year, month), NA_run, 2)
rem

    ID year month day      value
1  182 2002     1   1 -0.5012581
2  182 2002     1   2  1.1022975
3  182 2002     1   3         NA
4  182 2002     2   1 -0.1623095
5  182 2002     2   2  1.1022975
6  182 2002     2   3 -1.2519859
7  182 2003     2   1  0.9729168
8  182 2003     2   2  0.9594941
9  182 2003     2   3         NA
10 182 2004     1   1         NA
11 182 2004     1   2 -1.1088896
12 182 2004     1   3  0.9594941
13 182 2004     2   1 -0.4027320
14 182 2004     2   2 -0.0151383
15 182 2004     2   3 -1.0686427

You can see that the two in a row NAs have been removed. The one remaining is there because it belongs to two different months. Now we're going to fill in the remaining NAs. The na.rm=FALSE argument is there to keep the NAs if they're right at the beginning (which is what you want, I think).

library(zoo)
rem$value <- na.locf(rem$value, na.rm=FALSE)
rem

    ID year month day      value
1  182 2002     1   1 -0.5012581
2  182 2002     1   2  1.1022975
3  182 2002     1   3  1.1022975
4  182 2002     2   1 -0.1623095
5  182 2002     2   2  1.1022975
6  182 2002     2   3 -1.2519859
7  182 2003     2   1  0.9729168
8  182 2003     2   2  0.9594941
9  182 2003     2   3  0.9594941
10 182 2004     1   1  0.9594941
11 182 2004     1   2 -1.1088896
12 182 2004     1   3  0.9594941
13 182 2004     2   1 -0.4027320
14 182 2004     2   2 -0.0151383
15 182 2004     2   3 -1.0686427

Now all you need to do to make this 5 or more with your data is to change the value of the maxlen argument in NA_run to 5.

EDIT: Alternatively, if you don't want values to copy over from previous months:

library(zoo)
rem$value <- ddply(rem, .(ID, year, month), summarise, 
                   value=na.locf(value, na.rm=FALSE))$value
rem

    ID year month day      value
1  182 2002     1   1 -0.5012581
2  182 2002     1   2  1.1022975
3  182 2002     1   3  1.1022975
4  182 2002     2   1 -0.1623095
5  182 2002     2   2  1.1022975
6  182 2002     2   3 -1.2519859
7  182 2003     2   1  0.9729168
8  182 2003     2   2  0.9594941
9  182 2003     2   3  0.9594941
10 182 2004     1   1         NA
11 182 2004     1   2 -1.1088896
12 182 2004     1   3  0.9594941
13 182 2004     2   1 -0.4027320
14 182 2004     2   2 -0.0151383
15 182 2004     2   3 -1.0686427

This works perfectly, thanks so so much! BTW, I have another condition say that if missing days are between 2-4 days, I need to do a linear interpolation, here is what I tried: NA_run <- function(x, maxlen){ runs <- rle(is.na(x$value)) miss <- runs$lengths[runs$values] ifelse(miss >= maxlen, NULL, ifelse((miss > maxlen-3)&(miss < maxlen), approx(x$value, n=length(x$value))$y, x)) } But get an error says:replacement has length zero — Rosa, Jul 30 '12 at 21:30
That's a different question altogether, but what I'd do is set `na.locf`'s maximum NA fill to 1, then use `rle` to work out the indices of the remaining NA runs. Find the index before and after and the runs and do linear interpolation on those. — sebastian-c, Jul 31 '12 at 01:09
I was intended to try a simpler example, but seems made it more complex. :p What I need to do is say if there are less than 5 NA in a row, repeat the prior data, if 5 to 10 NA, do linear interpolation, but if more than 10 NA, delete the month. Your codes worked very well for those two parts, but I get confused to the comment, do you mean start all over again or just make change to NA_run? Thank you very much! — Rosa, Jul 31 '12 at 02:36
I suggest: 1. Use `NA_run` to remove all >10 NAs; 2. Use `na.locf` to fill down <5 in a row; 3. Use `rle` to find the largest runs of NAs. Find where they start and where they end. Extract those, perform linear interpolation, put them back in. — sebastian-c, Jul 31 '12 at 04:31
Hello, I'm back :( I thought my code worked since there was no error, but when I check the result, I found that:1, I tried rem$value <- na.locf(rem$value, na.rm=FALSE, maxgap=5), it replaced all NA<5, but the problem is when the first day of the month is NA, it repeats the day of the previous month, I would rather do na.locf(rem$value, na.rm=FALSE, maxgap=5, fromLast), so does to the linear interpolation if there are more than 5 NAs within a month, so I started to write another function, but keeps popping out error, please help, it drives me crazy — Rosa, Aug 01 '12 at 20:30
@Rosa Oh, I actively included that feature because I thought that's what you wanted. I'll edit my answer. — sebastian-c, Aug 02 '12 at 01:30
THANK YOU SO MUCH!! Now the result turns out to be exactly as what I wanted, thanks again. — Rosa, Aug 02 '12 at 03:49
Hi, sebastian-c, I came across a new problem, which is similar to this one, only now I have a new column in the data frame called hour(0:23) for each day, what I need to do is: If maxlen >= 10 missing day-hours(7AM-7PM) or >= 5 missing night-hours(7PM-7AM), I will delete the day, otherwise just run linear interpolation. I managed to delete the >=10 hours and interpolate the <=5 hours, but stuck in the 5-10 hours due to the day and night hours, please help if you have any idea, thank you! — Rosa, Aug 16 '12 at 21:04
Generally, it's a good idea to start new questions for different issues. Comments don't allow you to elaborate well. I suggest you make a separate column indicating whether it is day or night and apply the different methods to each of those subsets. — sebastian-c, Aug 17 '12 at 01:00
I was not able to figure it out, so I posted a new question: R ifelse condition with hourly data: frequency of continuously NA, appreciate it if you can help, thanks — Rosa, Aug 17 '12 at 18:49

score 0 · Answer 2 · answered Jul 28 '12 at 02:52

0

I'd do this in two steps:

An rle, rollapply, or shift-based strategy to fill in the small gaps (fewer than 5 NAs in a row).
A by, aggregate, or ddply-based strategy to take any month with NAs remaining after step 1 and make the whole month NA.

answered Jul 28 '12 at 02:52

Ari B. Friedman

71,271
35
175
235

R ifelse condition: frequency of continuously NA

2 Answers2

Linked