0
df <- cbind(c(1,1,1,1,1,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5), c(6,12,18,24,30,3,9,21,6,12,18,24,30,36,6,12,18,24,30,36,12,24,36,48), c(0.4,1.5,2.7,1.6,0.4,1.3,3.1,3.6,0.5,2.6,3.7,1.8,0.9,0.3,0.7,1.6,1.3,2.8,1.9,1.8,2.0,1.0,3.0,0.8))
colnames(df) <- c("ID","time","value")

I have a dataset as given by the code above. I would like to know that for each ID, starting from the lowest/starting time, if the value bounced up by at least 2 compared to the lowest value before the rise and then went down below or equal to the lowest pre-bounce value. I would like to flag the time of the increase by at least 2 as the time of bounce.

So for example, in the above dataset, for ID 1, the lowest value was 0.4 at time 6 before it started rising. At time 18 it met the pre-defined threshold of 2 and then at time 30, it went down to a value equal to the pre-bounce lowest value. So I would like to flag ID 1 has having bounce and time 18 as the time for bounce.

On the other hand, for ID 2, although it rose at least by a value 2 (1.3-->3.6), never went back to a value below or equal to 1.3

For ID 3, it again met the criteria for bounce (0.5-->2.6-->3.7-->1.8-->0.9-->0.3). So I would like to flag ID 2 as having bounce and month 18 as the time for bounce.

For ID 4, although there was a rise by at least 2 i.e. from 0.7-->1.6-->1.3-->2.8 (at time 24), however, later on it never went down below 0.7 the lowest value before having the bounce. So it cannot be flagged as having bounce.

For ID 5, the values were 2-->1-->3-->0.8, so there was a bounce by at least 2 (1-->3) and then a fall to a value below the lowest pre-bounce value (0.8 <1.0). So this ID should be flagged as having a bounce and the time of bounce should be time 36.

Please help me with this dynamic calculation and also explain the codes if possible so that I can understand the concept. Thank you in advance.

Biostats
  • 51
  • 8
  • Would appreciate any help – Biostats Mar 19 '21 at 00:39
  • This seems involved. However, you should start with a single ID and move up from there. See `diff(df$value)`. And `which(diff(df$value) > 2)`. There are a lot of users that would do not mind helping out but as of now, it just seems like you are looking for a code writing service. – Cole Mar 19 '21 at 01:19
  • @Cole...thank you for your answer but I am not looking for a code writing service...I want to understand if there is a way this code can be written using dplyr or data.table... – Biostats Mar 19 '21 at 01:23
  • @Cole moreover I have clearly mentioned that I want to understand the logic instead of just copying the codes...I cannot change your interpretation of my question but your comment is pretty judgmental – Biostats Mar 19 '21 at 01:26
  • I meant no judgement. Only that you should try to come up with a solution as it is expected that OP have provided an attempt. Note I gave you some tools and suggestions to move forward with. What have you done to attempt to solve your question? It's good that you want to understand the logic, but this is not a simple question; it's _very_ specific with thresholds and everything. *Edit* note that when asking a question, #2 says "Describe what you've tried" and #3 says "Show some code". – Cole Mar 19 '21 at 01:32

1 Answers1

1

Consider this:

func <- function(tm, val, threshold = 2) {
  mtx <- outer(val, val, `-`)
  mtx[upper.tri(mtx)] <- NA
  if (all(mtx < threshold, na.rm = TRUE)) return(tm[NA][1])
  ij <- which.max(mtx) # counts through the matrix, along columns
  i <- (ij-1) %/% length(val) + 1
  j <- (ij-1) %% length(val) + 1
  if (i < length(val) && any(val[-seq_len(i)] <= val[i])) {
    return(tm[j])
  } else {
    return(tm[NA][i])
  }
}

df <- data.frame(
  ID = c(1,1,1,1,1,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5),
  time = c(6,12,18,24,30,3,9,21,6,12,18,24,30,36,6,12,18,24,30,36,12,24,36,48),
  value = c(0.4,1.5,2.7,1.6,0.4,1.3,3.1,3.6,0.5,2.6,3.7,1.8,0.9,0.3,0.7,1.6,1.3,2.8,1.9,1.8,2.0,1.0,3.0,0.8)
)

I use which.max and the %/% and %% operators because in general I don't like doing which(val == max(val), arr.ind = TRUE); while the latter works, it is also relying on equality tests of floating point numbers, which can be problematic with extreme values. See Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754. If you don't like this safe-guarding, feel free to adapt the function to use which(.) instead.

The reason I go through the trouble of tm[NA][1] is so that the return value is of the exact class as your input time variable. For instance, dplyr in many situations can warn or err if the value you're changing in a vector is not the same class. This warning or error is good, as R's native (and silent) coercion of values can be problematic. For instance, Sys.time() is class POSIXt but NA is not. But Sys.time()[NA] is class POSIXt. Similarly, integer and numeric both have different types of NA. Perhaps this is being a bit over-defensive, but the use of tm[NA][1] ensures that the output is always the same class as the input time.

dplyr

library(dplyr)
# # A tibble: 5 x 2
#      ID  time
# * <dbl> <dbl>
# 1     1    18
# 2     2    NA
# 3     3    18
# 4     4    NA
# 5     5    36

data.table

library(data.table)
DF <- as.data.table(df)
DF[, .(time = func(time, value)), by = .(ID)]
#       ID  time
#    <num> <num>
# 1:     1    18
# 2:     2    NA
# 3:     3    18
# 4:     4    NA
# 5:     5    36
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • thank you very very much. I incredibly appreciate your thorough explanation. That is so much helpful. I can see the logic so clearly now. Thanks again..... – Biostats Mar 19 '21 at 18:03