0

I recently tried to match adjacent identical rows in a dataframe based on two variables (Condition1 and Outcome1 below). I have seen people doing this with all rows but not with adjacent rows, which is why I developed the following three-step work-around (which I hope did not overthink things):

-I lagged the variables based on which I wanted the matching to be done.

-I compared the variables and lagged-variables

-I deleted all rows in which both ware identical (and removed the remaining unnecessary columns).

Case <- c("Case 1", "Case 2", "Case 3", "Case 4", "Case 5")
Condition1 <- c(0, 1, 0, 0, 1)
Outcome1 <- c(0, 0, 0, 0, 1)
mwa.df <- data.frame(Case, Condition1, Outcome1)

new.df <- mwa.df
Condition_lag <- c(new.df$Condition1[-1],0)
Outcome_lag <- c(new.df$Outcome1[-1],0)
new.df <- cbind(new.df, Condition_lag, Outcome_lag)
new.df$Comp <- 0
new.df$Comp[new.df$Outcome1 == new.df$Outcome_lag & new.df$Condition1 == new.df$Condition_lag] <- 1
new.df <- subset(new.df, Comp == 0)
new.df <- subset(new.df, select = -c(Condition_lag, Outcome_lag, Comp))

This worked just fine. But when I tried to create a function for this because I had to do this operation with a large number of data frames, I encountered the problem that the lag did not work (i.e. the condition_lag <- c(new.df$condition[-1],0) and outcome_lag <- c(new.df$outcome[-1],0) operations were not carried out). The function code was:

FLC.Dframe <- function(old.df, condition, outcome){
      new.df <- old.df
      condition_lag <- c(new.df$condition[-1],0)
      outcome_lag <- c(new.df$outcome[-1],0)
      new.df <- cbind(new.df, condition_lag, outcome_lag)
      new.df$comp <- 0
      new.df$comp[new.df$outcome == new.df$outcome_lag & new.df$condition == new.df$condition_lag] <- 1
      new.df <- subset(new.df, comp == 0)
      new.df <- subset(new.df, select = -c(condition_lag, outcome_lag, comp))
      return(new.df)
}

As for using the function, I wrote new.df <- FLC.Dframe(mwa.df, Condition1, Outcome1).

Could someone help me with this? Many thanks in advance.

CNiessen
  • 89
  • 6
  • Does this answer your question? [Remove/collapse consecutive duplicate values in sequence](https://stackoverflow.com/questions/27482712/remove-collapse-consecutive-duplicate-values-in-sequence) – ekoam Nov 11 '20 at 09:39
  • Thanks for helping. I looked into it and this seems to work when matching based on *one* variable. However, I would need to compare *two (or more)* variables. – CNiessen Nov 11 '20 at 09:58
  • Then check this function `data.table::rleid`, which allows you to specify multiple variables – ekoam Nov 11 '20 at 09:59
  • Many thanks again! I did my very best, but I could not write a code with `rleid()` that would solve the problem. May I ask how you would do it? [Besides, may I ask if you have an idea of why my function above did not work?] – CNiessen Nov 11 '20 at 10:42

1 Answers1

0

Just generate run-length ids and remove the duplicates.

with(mwa.df, mwa.df[!duplicated(data.table::rleid(Condition1, Outcome1)), ])

Output

    Case Condition1 Outcome1
1 Case 1          0        0
2 Case 2          1        0
3 Case 3          0        0
5 Case 5          1        1

If you want a function, then

FLC.Dframe <- function(df, cols) df[!duplicated(data.table::rleidv(df[, cols])), ]

Call this function like this

> FLC.Dframe(mwa.df, c("Condition1", "Outcome1"))

    Case Condition1 Outcome1
1 Case 1          0        0
2 Case 2          1        0
3 Case 3          0        0
5 Case 5          1        1

The main problem with your function concerns the incorrect usage of $. This operator treats RHS input as is. For example, in this line new.df$condition the $ operator attempts to find in new.df a column named "condition", but not "Condition1", which is the value of condition. If you rewrite your function as follows, then it should work.

FLC.Dframe <- function(old.df, condition, outcome){
  new.df <- old.df
  condition_lag <- c(new.df[[condition]][-1],0)
  outcome_lag <- c(new.df[[outcome]][-1],0)
  new.df <- cbind(new.df, condition_lag, outcome_lag)
  new.df$comp <- 0
  new.df$comp[new.df[[outcome]] == new.df[["outcome_lag"]] & new.df[[condition]] == new.df[["condition_lag"]]] <- 1
  new.df <- subset(new.df, comp == 0)
  new.df <- subset(new.df, select = -c(condition_lag, outcome_lag, comp))
  return(new.df)
} 

You also need to call it like this (note that you need to use characters as inputs)

> FLC.Dframe(mwa.df, "Condition1", "Outcome1")

    Case Condition1 Outcome1
1 Case 1          0        0
2 Case 2          1        0
4 Case 4          0        0
5 Case 5          1        1
ekoam
  • 8,744
  • 1
  • 9
  • 22