0

I have a dataframe and I have to check If its one column has not changed during the last 60 minutes (each 10 minutes I have a data point) SO I have to flag the corresponding data point Ti as bad data point (if data point Ti plus 5 data points before from Ti−1 to Ti−5 are equal) . how can I do that?

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294

2 Answers2

0

data.table::rleid is very useful for things like this:

## sample data
d = data.frame(x = c(1, 2,2, 3,3,3,3,3,3,3,3, 4,4,4 2,2,2,2,2,2,2))

library(dplyr)
d %>% 
  group_by(rleid = data.table::rleid(x)) %>%
  mutate(bad_flag = row_number() > 5)
# # A tibble: 18 x 3
# # Groups:   rleid [5]
#        x rleid bad_flag
#    <dbl> <int> <lgl>   
#  1     1     1 FALSE   
#  2     2     2 FALSE   
#  3     2     2 FALSE   
#  4     3     3 FALSE   
#  5     3     3 FALSE   
#  6     3     3 FALSE   
#  7     3     3 FALSE   
#  8     3     3 FALSE   
#  9     3     3 TRUE    
# 10     3     3 TRUE    
# 11     4     4 FALSE   
# 12     4     4 FALSE   
# 13     2     5 FALSE   
# 14     2     5 FALSE   
# 15     2     5 FALSE   
# 16     2     5 FALSE   
# 17     2     5 FALSE   
# 18     2     5 TRUE    

This does assume that your measurements are not floating point numbers (integers are fine). If they are floating points, you may need to round them to be safe.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
0

Please read the comments to this post. The authors make excellent points to this answer.

A straighforward solution might be

### your df ...
df <- ...

get.previous.five.indexes(idx)
{
     to.substract <- 1:5

     return(idx - to.substract)
}


for (idx in 6:nrow(df))
{
     previous.five.indexes <- get.previous.five.indexes(idf)

     current.value   <- df$my.value[idx]
     previous.values <- df$my.value[previous.five.indexes]

     df$flag[idx]    <- isTRUE(all.equal(previous.values, row["value"]))
}

If you want to avoid a for loop, you can also use apply.

df <- ...

get.previous.five.indexes(idx)
{
     to.substract <- 1:5

     return(idx - to.substract)
}


df$idx <- 1:nrow(df)

df$flag <- apply(df,
                 margin = 1,
                 fun = function(row)
                 {
                      if (row["idx"] < 6)
                      {
                           return(FALSE)
                      }

                      previous.five.indexes <- get.previous.five.indexes(row["idx"])

                      previous.values <- df$my.value[previous.five.indexes]

                      return(isTRUE(all.equal(previous.values, row["value"])))
                 })

HTH

MacOS
  • 1,149
  • 1
  • 7
  • 14
  • 1
    A couple comments - I think by subtracting negative numbers you're actually adding - `to.subtract <- 1:5` should be better. And with recycling I think you can skip the `rep_len`, `idx - 1:5` will return the same value if `idx` is length 1 as if you repeat it. Similarly you can use `all(previous.values == current.value)`. Though if you do use `all.equal`, heed the warning in the help file: *"Do not use `all.equal` directly in if expressions—either use `isTRUE(all.equal(....))` or `identical` if appropriate."* – Gregor Thomas Jan 05 '21 at 16:39
  • 1
    To add to @GregorThomas warning: whatever comparison you use, if your inputs are numeric (non-integer), you have to set a "limit' on how different they can be and still be considered the same value. – Carl Witthoft Jan 05 '21 at 19:10
  • @CarlWitthoft: Excellent point! Thank you! How does `data.table::rleid ` handle that? – MacOS Jan 05 '21 at 19:13
  • 2
    The `rleid` approach doesn't handle that at all - just added a caveat to my answer pointing that out. `all.equal` *does* handle that well, but you would want to wrap it in `isTRUE` as the help page suggests. (It doesn't return `FALSE` for non-equal inputs, see `all.equal(3, 4)`.) – Gregor Thomas Jan 05 '21 at 19:21
  • 1
    But one other code simplication, `if(all(...)) {return(TRUE)} else{return(FALSE)}` is a long way to write `return(all(...))`. – Gregor Thomas Jan 05 '21 at 19:23
  • And this is not a good use of `apply`, which is made for matrices. `sapply` would be a better choice, probably. – Gregor Thomas Jan 05 '21 at 19:26
  • Acoording to documentation, `sapply` is just a wrapper around `lapply`. – MacOS Jan 05 '21 at 19:29
  • @MacOS the man page says it works like `rle` (probably my alltime favorite tool ), which suggests it expects to be working with char or integer values in the first place. Hence GregorT's warnings about `all.equal` . FWIW, my package `cgwtools::approxeq` may be a more "friendly" tool than `all.equal` – Carl Witthoft Jan 05 '21 at 20:13
  • @MacOS yes, `lapply` would also work (though it would return a `list`, `sapply` will automatically simplify to a vector in this case). `sapply` or `unlist(lapply(...))` would be fine choices here. `apply` is worse here, because `apply` is made for matrices. – Gregor Thomas Jan 06 '21 at 14:25
  • [This FAQ](https://stackoverflow.com/a/7141669/903061) is very helpful in understanding the different *apply functions and when they should be used. – Gregor Thomas Jan 06 '21 at 14:26