2

I'm new to R, and I was looking for similar questions, but was not able to find one to fix mine, any help would be appreciated.

I have a data frame M:

            date value
1 182-2002-01-01 23.95
2 182-2002-01-02 17.47
3 182-2002-01-03  NA
4 183-2002-01-01  NA
5 183-2002-01-02  5.50
6 183-2002-01-03 17.02

What I need to do is: if there are less than 5 NA (continuously), I will just repeat the previous number(17.47), and if there are more than 5 NA in a row, I will need to delete the whole month.

I tried function rle many times, but didn't work, many thanks for your help.

Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
Rosa
  • 1,793
  • 5
  • 18
  • 23
  • Welcome to Stack Overflow! You will find that you get better answers if you take the time to make your question reproducible. Please follow the guidelines (http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), paying special attention to the part about `dput()`. Thanks! – Ari B. Friedman Jul 28 '12 at 02:45

2 Answers2

2

I'm going to adjust your question a little bit for the purposes of demonstration. I'm going to use a similar dataset to you, but for 2 NAs in a row. This generalises to 5 very easily, don't worry. I'm also going to use a data set that better demonstrates the solution

So first, how to get your data to look like what I'm going to use:

library(reshape)
M2<-data.frame(colsplit(M$date, "-", c("ID", "year", "month", "day")), 
               value=M$value)

Now that's out of the road, this is the data I'm going to work with:

library(reshape)
M2<-data.frame(colsplit(M$date, "-", c("ID", "year", "month", "day")), 
               value=M$value)

set.seed(1234)
M2<-expand.grid(ID=182, year=2002:2004, month=1:2, day=1:3, KEEP.OUT.ATTRS=FALSE)
M2 <- M2[with(M2, order(year, month, day, ID)),] #sort the data
M2$value <- sample(c(NA, rnorm(100)), nrow(M2), 
                   prob=c(0.5, rep(0.5/100, 100)), replace=TRUE)
M2

    ID year month day      value
1  182 2002     1   1 -0.5012581
7  182 2002     1   2  1.1022975
13 182 2002     1   3         NA
4  182 2002     2   1 -0.1623095
10 182 2002     2   2  1.1022975
16 182 2002     2   3 -1.2519859
2  182 2003     1   1         NA
8  182 2003     1   2         NA
14 182 2003     1   3         NA
5  182 2003     2   1  0.9729168
11 182 2003     2   2  0.9594941
17 182 2003     2   3         NA
3  182 2004     1   1         NA
9  182 2004     1   2 -1.1088896
15 182 2004     1   3  0.9594941
6  182 2004     2   1 -0.4027320
12 182 2004     2   2 -0.0151383
18 182 2004     2   3 -1.0686427

First, we're going to remove all cases where, within a month, there are 2 or more NAs in a row:

NA_run <- function(x, maxlen){
  runs <- rle(is.na(x$value))
  if(any(runs$lengths[runs$values] >= maxlen)) NULL else x
  }

library(plyr)
rem <- ddply(M2, .(ID, year, month), NA_run, 2)
rem

    ID year month day      value
1  182 2002     1   1 -0.5012581
2  182 2002     1   2  1.1022975
3  182 2002     1   3         NA
4  182 2002     2   1 -0.1623095
5  182 2002     2   2  1.1022975
6  182 2002     2   3 -1.2519859
7  182 2003     2   1  0.9729168
8  182 2003     2   2  0.9594941
9  182 2003     2   3         NA
10 182 2004     1   1         NA
11 182 2004     1   2 -1.1088896
12 182 2004     1   3  0.9594941
13 182 2004     2   1 -0.4027320
14 182 2004     2   2 -0.0151383
15 182 2004     2   3 -1.0686427

You can see that the two in a row NAs have been removed. The one remaining is there because it belongs to two different months. Now we're going to fill in the remaining NAs. The na.rm=FALSE argument is there to keep the NAs if they're right at the beginning (which is what you want, I think).

library(zoo)
rem$value <- na.locf(rem$value, na.rm=FALSE)
rem

    ID year month day      value
1  182 2002     1   1 -0.5012581
2  182 2002     1   2  1.1022975
3  182 2002     1   3  1.1022975
4  182 2002     2   1 -0.1623095
5  182 2002     2   2  1.1022975
6  182 2002     2   3 -1.2519859
7  182 2003     2   1  0.9729168
8  182 2003     2   2  0.9594941
9  182 2003     2   3  0.9594941
10 182 2004     1   1  0.9594941
11 182 2004     1   2 -1.1088896
12 182 2004     1   3  0.9594941
13 182 2004     2   1 -0.4027320
14 182 2004     2   2 -0.0151383
15 182 2004     2   3 -1.0686427

Now all you need to do to make this 5 or more with your data is to change the value of the maxlen argument in NA_run to 5.

EDIT: Alternatively, if you don't want values to copy over from previous months:

library(zoo)
rem$value <- ddply(rem, .(ID, year, month), summarise, 
                   value=na.locf(value, na.rm=FALSE))$value
rem

    ID year month day      value
1  182 2002     1   1 -0.5012581
2  182 2002     1   2  1.1022975
3  182 2002     1   3  1.1022975
4  182 2002     2   1 -0.1623095
5  182 2002     2   2  1.1022975
6  182 2002     2   3 -1.2519859
7  182 2003     2   1  0.9729168
8  182 2003     2   2  0.9594941
9  182 2003     2   3  0.9594941
10 182 2004     1   1         NA
11 182 2004     1   2 -1.1088896
12 182 2004     1   3  0.9594941
13 182 2004     2   1 -0.4027320
14 182 2004     2   2 -0.0151383
15 182 2004     2   3 -1.0686427
sebastian-c
  • 15,057
  • 3
  • 47
  • 93
  • This works perfectly, thanks so so much! BTW, I have another condition say that if missing days are between 2-4 days, I need to do a linear interpolation, here is what I tried: NA_run <- function(x, maxlen){ runs <- rle(is.na(x$value)) miss <- runs$lengths[runs$values] ifelse(miss >= maxlen, NULL, ifelse((miss > maxlen-3)&(miss < maxlen), approx(x$value, n=length(x$value))$y, x)) } But get an error says:replacement has length zero – Rosa Jul 30 '12 at 21:30
  • That's a different question altogether, but what I'd do is set `na.locf`'s maximum NA fill to 1, then use `rle` to work out the indices of the remaining NA runs. Find the index before and after and the runs and do linear interpolation on those. – sebastian-c Jul 31 '12 at 01:09
  • I was intended to try a simpler example, but seems made it more complex. :p What I need to do is say if there are less than 5 NA in a row, repeat the prior data, if 5 to 10 NA, do linear interpolation, but if more than 10 NA, delete the month. Your codes worked very well for those two parts, but I get confused to the comment, do you mean start all over again or just make change to NA_run? Thank you very much! – Rosa Jul 31 '12 at 02:36
  • I suggest: 1. Use `NA_run` to remove all >10 NAs; 2. Use `na.locf` to fill down <5 in a row; 3. Use `rle` to find the largest runs of NAs. Find where they start and where they end. Extract those, perform linear interpolation, put them back in. – sebastian-c Jul 31 '12 at 04:31
  • Hello, I'm back :( I thought my code worked since there was no error, but when I check the result, I found that:1, I tried rem$value <- na.locf(rem$value, na.rm=FALSE, maxgap=5), it replaced all NA<5, but the problem is when the first day of the month is NA, it repeats the day of the previous month, I would rather do na.locf(rem$value, na.rm=FALSE, maxgap=5, fromLast), so does to the linear interpolation if there are more than 5 NAs within a month, so I started to write another function, but keeps popping out error, please help, it drives me crazy – Rosa Aug 01 '12 at 20:30
  • @Rosa Oh, I actively included that feature because I thought that's what you wanted. I'll edit my answer. – sebastian-c Aug 02 '12 at 01:30
  • THANK YOU SO MUCH!! Now the result turns out to be exactly as what I wanted, thanks again. – Rosa Aug 02 '12 at 03:49
  • Hi, sebastian-c, I came across a new problem, which is similar to this one, only now I have a new column in the data frame called hour(0:23) for each day, what I need to do is: If maxlen >= 10 missing day-hours(7AM-7PM) or >= 5 missing night-hours(7PM-7AM), I will delete the day, otherwise just run linear interpolation. I managed to delete the >=10 hours and interpolate the <=5 hours, but stuck in the 5-10 hours due to the day and night hours, please help if you have any idea, thank you! – Rosa Aug 16 '12 at 21:04
  • Generally, it's a good idea to start new questions for different issues. Comments don't allow you to elaborate well. I suggest you make a separate column indicating whether it is day or night and apply the different methods to each of those subsets. – sebastian-c Aug 17 '12 at 01:00
  • I was not able to figure it out, so I posted a new question: R ifelse condition with hourly data: frequency of continuously NA, appreciate it if you can help, thanks – Rosa Aug 17 '12 at 18:49
0

I'd do this in two steps:

  1. An rle, rollapply, or shift-based strategy to fill in the small gaps (fewer than 5 NAs in a row).
  2. A by, aggregate, or ddply-based strategy to take any month with NAs remaining after step 1 and make the whole month NA.
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235