Replace consecutive repeat values based on different run lengths in R

Question

Consider the following dataset:

dat<-data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3),
                var1 = c("A","NA","B","A","NA","NA","B","A","NA","NA","NA","C","A","NA","B","A","NA","NA","D","A","NA","NA","B"))

dat

First, I need to fill in all NA's with the values on either side of the NA's, which was successful in dplyr with:

mutate(value = ifelse(is.na(value), paste0(na.locf(value), "-", na.locf(value, fromLast=TRUE)), 
                        value))

This resulted in:

However, I now need to keep some values while return others back to NA, based on the consecutive run lengths of the repeats (grouped by id col). If the consecutive repeat of A-B is longer than 1 then return all values to NA, and if the consecutive repeat of A-C is longer than 2 then return all values to NA, and if the consecutive repeat of A-D is longer than 3 then return all values to NA.

My desired outcome is:

I think this could be accomplished by some combo of group_by(id), then rle() or data.table's rleid(), then conditionally turn the values back to NA based on the value and the run length with case_when (I considered ifelse() but I have many more conditions than provided in the example and have read that case_when would be a better choice), but I have not been able to figure out how to write the precise code to do this. A similar question I've come across is Replace NA with previous value with limit however, it is a much simpler version of what I need to do.

Any suggestions would be much appreciated. I feel like I am close but I need assistance to get me to the desired result.

what is 'value' column in the code – akrun Nov 10 '20 at 22:26 — akrun, Nov 10 '20 at 22:26

score 0 · Answer 1 · answered Nov 10 '20 at 23:58

Firstly, I strongly recommend to use replace_na function.

Then you could use smth like this:

x <- data.frame(matrix(c(1, 1, 1, 2, 2, 2, 3, 3, 3,
                         0, 1, 2, 3, 4, 4, 5, 5, 5), ncol=2))
x %>%
    group_by(X1) %>%
    mutate(X3 = ifelse(X2 == lag(X2, default = "") | X2 == lead(X2, default = ""),
                       X2, NA_integer_)) %>%
    group_by(X1, X3) %>%
    mutate(X4 = n())

Input:

Output:

# A tibble: 9 x 4
# Groups:   X1, X3 [4]
     X1    X2    X3    X4
  <dbl> <dbl> <dbl> <int>
1     1     0    NA     3
2     1     1    NA     3
3     1     2    NA     3
4     2     3    NA     1
5     2     4     4     2
6     2     4     4     2
7     3     5     5     3
8     3     5     5     3
9     3     5     5     3

And then you can make what you want using X4

score 0 · Accepted Answer · answered Nov 11 '20 at 07:09

0

What you could do:

myfun <- function(x){
  y <- rle(x)
  z <- match(y$values, LETTERS)
  ind <- which(is.na(z))
  m <- z[ind + 1] - z[ind - 1] >= y$lengths[ind]
  y$values[ind[m]] <- paste(y$values[ind[m] - 1], y$values[ind[m] + 1], sep = "-")
  inverse.rle(y)
}


transform(dat, var1 = ave(var1, id, FUN = myfun))

   id var1
1   1    A
2   1  A-B
3   1    B
4   1    A
5   1   NA
6   1   NA
7   1    B
8   2    A
9   2   NA
10  2   NA
11  2   NA
12  2    C
13  2    A
14  2  A-B
15  2    B
16  3    A
17  3  A-D
18  3  A-D
19  3    D
20  3    A
21  3   NA
22  3   NA
23  3    B

answered Nov 11 '20 at 07:09

Onyambu

67,392
3
24
53

Sorry if this is something on my end, but the code is returning an error "Error in rle(x) : 'x' must be a vector of an atomic type", any thoughts? – el88 Nov 12 '20 at 16:20
@el88 ensure you pass an atomic vector into the function. Do not pass lists, dataframes etc – Onyambu Nov 12 '20 at 16:23
Thanks @Onyambu. I was also to resolve the error. Since I am new to creating functions, would you mind explaining the calculation that you have done for m? ie. z[ind + 1] - z[ind - 1] >= y$lengths[ind] – el88 Nov 16 '20 at 16:48
@Onyambu I think it would be helpful if you posted an explanation to go with your function - I'm also a bit curious regarding the `m` line. Am I correct in understanding that it is essentially "If the difference between two letters is greater than the run length of `NA`, insert the letter pair"? Why for only those pairs? Thanks ahead of time. – spops Nov 16 '20 at 20:59
@el88 The line states that if you have the difference between two letters eg D-A (4-1) = 3 to be greater than the NA between them, then that is valid NA's and can be replaced by the range A-D. – Onyambu Nov 16 '20 at 21:04
@Onyambu It's still unclear to me how the function is able to differentiate between keeping the pair of A-D values (rows 17 and 18 in the original dataset) but changing the pair of A-B values (rows 21 and 22 in the original dataset) considering the difference between the two letters is the same in both cases. – el88 Nov 17 '20 at 19:29
@el88 It is because D is the fourth letter while B is the 2nd letter. So in a way A-D allows a NAs of length 1,2,3 while A-B only allows NA's of length 1. – Onyambu Nov 17 '20 at 19:51

Replace consecutive repeat values based on different run lengths in R

2 Answers2