0

I am working with a dataset which has some errors in the data. Numbers are sometimes registered wrong. Here is some toy data example:

Fake data

The issue is that the Reversal column should only be counting up (per unique ID). So in a vector of 0,0,0,1,1,1,0,1,2,2,0,0,2,3, the 0's following the 1 and 2 should not be 0's. Instead, they should be equal to whatever value came before. I tried to remedy this by using the lag function from the dplyr package:

Data$Reversal <- ifelse(Data$Reversal < lag(Data$Reversal), lag(Data$Reversal), Data$Reversal) . But this results in numerous issues:

  1. The first value becomes NA. I've tried using the default=Data$Reversal call in the lag function but to no avail.
  2. The Reversal value should reset to 0 for each Unique ID. Now it continues across ID's. I tried a messy code using group_by(ID) but could not get this to work, as it broke my earlier ifelse function.
  3. This only works when there is 1 error. But if there are two errors in a row it only fixes 1 value.

Alternatively, I found this thread in which the answer provided by Andrie also seems promising. This fixes problem 1 and 3, but I can't get this code to work per ID (using the group_by function).

Andrie's answer:


local({
  r <- rle(data)
  x <- r$values
  x0 <- which(x==0) # index positions of zeroes
  xt <- x[x0-1]==x[x0+1] # zeroes surrounded by same value
  r$values[x0[xt]] <- x[x0[xt]-1] # substitute with surrounding value
  inverse.rle(r)
})

Any help would be much appreciated.

R. Iersel
  • 89
  • 9
  • Could you confirm that error values in second column are `B` and `C` values?? – Duck Sep 23 '20 at 14:04
  • @Duck Yes they are indeed B and C values. Owner values only change once a Success is reached. @27ϕ9 Amazing! It seems to be working (I used your first suggestion). I still need to fully understand what the code does (I never used `with` before). Thanks a bunch. Please reply as an answer so I can accept it! I would also much appreciate if you could include a short explanation of how `with` works with the `ave` function, though I think I get the gist of it. – R. Iersel Sep 23 '20 at 14:28

1 Answers1

2

I think cummax does exactly what you need.

Base R

dat$Reversal <- ave(dat$Reversal, dat$ID, FUN = cummax)
dat
#    ID Owner Reversal Success
# 1   1     A        0       0
# 2   1     A        0       0
# 3   1     A        0       0
# 4   1     B        1       1
# 5   1     B        1       0
# 6   1     B        1       0
# 7   1 error        1       0
# 8   1 error        1       0
# 9   1     B        1       0
# 10  1     B        1       0
# 11  1     C        1       1
# 12  1     C        2       0
# 13  1 error        2       0
# 14  1     C        2       0
# 15  1     C        3       1
# 16  2     J        0       0
# 17  2     J        0       0

dplyr

dat %>%
  group_by(ID) %>%
  mutate(Reversal = cummax(Reversal)) %>%
  ungroup()

data.table

as.data.table(dat)[, Reversal := cummax(Reversal), by = .(ID)][]

Data, courtesy of https://extracttable.com/

dat <- read.table(header = TRUE, text = "
ID  Owner   Reversal    Success
1   A   0   0
1   A   0   0
1   A   0   0
1   B   1   1
1   B   1   0
1   B   1   0
1   error   0   0
1   error   0   0
1   B   1   0
1   B   1   0
1   C   1   1
1   C   2   0
1   error   0   0
1   C   2   0
1   C   3   1
2   J   0   0
2   J   0   0")
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Yes! This works, thanks a lot! I'm a bit surprised because @27ϕ9 replied with a similar answer though removed his/her comment. Though now having seen both your answers how it all works is also clear to me. Thank you! – R. Iersel Sep 23 '20 at 14:35
  • 1
    @27ϕ9 posted their comment while I was mid-swing on the answer. Glad it helped. – r2evans Sep 23 '20 at 14:38