4

I have a large data frame (df) with binomial values ranging from 1 to 2. NAs are also included in the data. As a practical example, I will create a short vector containing a subset of a user's data:

df <- c(NA,NA,2,1,1,1,2,1,2,2,1,1,1,NA,2,2,1,2,1,1,1,2)

What I would basically like as an outcome is a function that searches for the first and the second 2s of an array and transforms everything within this interval to a 2. Nevertheless, if the difference between the positions of the second and the first 2 are > 3 then the values stay as they are and no change is performed.

In addition to the above, the function has to loop for each value of df. For example, considering again the case of:

df <- c(NA,NA,2,1,1,1,2,1,2,2,1,1,1,NA,2,2,1,2,1,1,1,2)

The function should have this outcome:

df_outcome <- c(NA,NA,2,1,1,1,2,2,2,2,1,1,1,NA,2,2,2,2,1,1,1,2)

Notice that in df_outcome the values between the very first and second 2's were not merged as the difference in their position was >3. On the other hand, other non-2 values were changed accordingly.

What I have tried to do (but it does not work properly):

With the help of rollapply in the zoo package, I have tried to create a function that finds the first and the second 2 of the array and performs the modifications as described above.

func <- function (q) {
  for (i in (which(q %in% 2)[1]):(which(q %in% 2)[2])) {
    q[i]<-2
  }
  return(q)
}

Then I nested it using rollapplyso I can specify a specific width for each cycle plus other arguments such as the position of the result index (left).

df_outcome<-rollapply(df, width = 3, FUN = func, fill = NA, partial = TRUE, align = "left")

The problem is that the user-generated function works if applied to a vector. When nested as an argument in the rollapply function however, it returns an error:

Error in (which(q %in% 2)[1]):(which(q %in% 2)[2]) : NA/NaN argument Called from: FUN(data[replace(posns, !ix, 0)], ...)

I guess there is some mistake from my part in the use of rollapply or perhaps the format of the data but I cannot understand what could be the issue. I thought about using rollapplybecause my data is very long and it is generated for different users. Hence, I would need a function that can also split the data with regards to other variables such as User_ID (much like the .variablesargument in ddply or by in data.table.

Looking forward for your support.

Jaap
  • 81,064
  • 34
  • 182
  • 193
IlBardo
  • 43
  • 5

3 Answers3

6

A solution with rle:

rldf <- rle(df)
rllag <- c(tail(rldf$values,-1), NA)
rllead <- c(NA, head(rldf$values,-1))

rldf$values[which(rldf$values == 1 & rllag == 2 & rllead == 2 & rldf$lengths < 3)] <- 2

df_out <- inverse.rle(rldf)

which gives:

> df_out
 [1] NA NA  2  1  1  1  2  2  2  2  1  1  1 NA  2  2  2  2  1  1  1  2

> identical(df_outcome,df_out)
[1] TRUE
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • 1
    This solution to the problem works very well for my set of data and taught me about the use of rle! Thanks for the support. – IlBardo Oct 31 '16 at 08:54
5

You can try to get the indices of the 2 in df. Then get the difference between those position and thus find the indices of values to replace by 2:

# position of the 2s
pos_df_2 <- which(df==2) 
# which of the difference in positions are less than 3
wh_pos2_inf3 <- which(c(FALSE, diff(pos_df_2)<=3))
# get all indices between positions that are separated by less than 3 elements
ind_to_replace <- unique(unlist(sapply(wh_pos2_inf3, function(x) {pos_df_2[x-1]:pos_df_2[x]}))) 
# replace the elements by 2
df[ind_to_replace] <- 2 
df
#[1] NA NA  2  1  1  1  2  2  2  2  1  1  1 NA  2  2  2  2  1  1  1  2
Cath
  • 23,906
  • 5
  • 52
  • 86
  • Thanks for the input. I actually thought about the use of indices and your solution provides a clear answer from this perspective. Thanks a lot. – IlBardo Oct 31 '16 at 08:56
4

Using data.table (but there's maybe a far better solution):

df<-c(NA, NA, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, NA, 2, 2, 1, 2, 1, 1, 1, 2)
dt<-data.table(val=df)
dt[,`:=`(id=rleid(val), p=shift(val,type="lag"), n=shift(val,type="lead"))]
dt[,`:=`(s=.N, f=p[1], e=n[.N]), by=id]
dt[f==2 & e==2 & s<3, val:=2]

In details:

Create df with a small difference to have two consecutive 1 in the test data

df<-c(NA, NA, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, NA, 2, 2, 1, 2, 1, 
  1, 1, 2)
dt<-data.table(val=df)

Create a rleid of val, lag and lead the val (for next step)

dt[,`:=`(id=rleid(val), p=shift(val, type="lag"), n=shift(val, type="lead"))]

Then by id (group) get the size of the group, previous and next value for this group

dt[,`:=`(s=.N, f=p[1], e=n[.N]), by=id]

Now filter per your requirements (previous an next is 2, size less than 3) and set the val to 2

dt[f==2 & e==2 & s<3, val:=2]

Which gives at end

dt[,val]
[1] NA NA  2  2  2  2  2  2  2  2  1  1  1 NA  2  2  2  2  1  1  1  2

Compared to source:

[1] NA NA  2  1  1  2  2  1  2  2  1  1  1 NA  2  2  1  2  1  1  1  2   

It seems to give what you're expecting.

Tensibai
  • 15,557
  • 1
  • 37
  • 57
  • 1
    It took me some time to understand this but it does wonders - thanks. I do not have much experience with data.table so I am looking forward to get into it a bit more. Thanks for the thorough explanation. – IlBardo Oct 31 '16 at 08:59