Extract decreasing sequences with certain start and end values

Question

I have a numeric vector and I want to extract runs of decreasing values. In addition, the first value in each sequence should be >= 40, and the last value should be <= 20.

For example:

Mydata = c(1, 5, 0, 10, 40, 30, 25, 20, 7, 34, 23, 55, 70, 42, 38, 22, 44, 33, 11, 17, 25)

The resulting sequences are: c(40, 30, 25, 20, 7) and c(44, 33, 11).

Does it need to be efficient and/or idiomatic? If not then you can just do a for loop over the whole list and a while loop starting at each item of the for loop. — Marijn, Apr 24 '23 at 19:31
Hi @Marijn , it can be either way as long as it works. Could you provide a coding example? Thanks. — Yang Yang, Apr 24 '23 at 19:42

score 3 · Answer 1 · answered Apr 24 '23 at 20:15

A non-idiomatic procedural approach:

Mydata = c(1, 5, 0, 10, 40, 30, 25, 20, 7, 34, 23, 55, 70, 42, 38, 22, 44, 33, 11, 17, 25)
results = list()
# loop each element of the data vector to check if it can be the start of a result
for (x in 1:length(Mydata)) {
  if (Mydata[x] >= 40) {
    # start subresult list
    subresult = c(Mydata[x])
    i = 0
    # add elements while decreasing
    while (Mydata[x+i+1] < Mydata[x+i]) {
      subresult = append(subresult, Mydata[x+i+1])
      i = i + 1
    }
    # store in main result list if last element of subresult <= 20
    if (subresult[length(subresult)] <= 20){
      results[[length(results)+1]] = subresult
    }
  }
}

Result:

> results
[[1]]
[1] 40 30 25 20  7

[[2]]
[1] 44 33 11

Henrik · Answer 2 · 2023-04-24T21:36:41.867

Using the "standard" way to create a grouping variable based of differences between values (cumsum(...diff(...)); Create grouping variable for consecutive sequences and split vector). Check conditions by group using tapply. Remove empty list elements.

L = tapply(x, cumsum(c(1L, diff(x) > 0)), \(v) if(v[1] >= 40 & tail(v, 1) <= 20) v)
L[lengths(L) != 0]
$`4`
[1] 40 30 25 20  7

$`8`
[1] 44 33 11

Or filter the result from tapply in one go:

Filter(Negate(is.null), tapply(x, cumsum(c(1L, diff(x) > 0)), \(v) if(v[1] >= 40 & tail(v, 1) <= 20) v))

Same logic using data.table:

library(data.table)
data.table(x)[, if(x[1] >= 40 & x[.N] <= 20) x, by = .(g = cumsum(c(1L, diff(x) > 0)))]
       g    V1
   <int> <num>
1:     4    40
2:     4    30
3:     4    25
4:     4    20
5:     4     7
6:     8    44
7:     8    33
8:     8    11

score 2 · Accepted Answer · answered Apr 24 '23 at 19:52

2

library(dplyr)
data.frame(x = Mydata) |>
  filter(lag(x) > x | lead(x) < x) |>
  mutate(id = cumsum(c(0, diff(x)) > 0)) |>
  group_by(id) |>
  filter(first(x) >= 40 & last(x) <= 20) |>
  with(split(x, id)) |>
  unname()
# [[1]]
# [1] 40 30 25 20  7
# 
# [[2]]
# [1] 44 33 11

answered Apr 24 '23 at 19:52

Gregor Thomas

136,190
20
167
294

Thanks a lot for your help! Could you please explain the purpose of `filter(lag(x) > x | lead(x) < x)`? – Yang Yang Apr 24 '23 at 20:26
1

That's getting the candidate rows for a decreasing run - `x` can be part of a decreasing sequence either if (a) the value before it is greater than it (`lag(x) > x`) or (b) if the value after it is less than it (`lead(x) < x`). If neither (a) nor (b) are true, then `x` is not part of a decreasing run. – Gregor Thomas Apr 24 '23 at 20:29

score 2 · Answer 4 · answered Apr 24 '23 at 20:13

Using dplyrs consecutive_id to get the grouping, group_split to separate the groups and a surrounding sapply to extract the groups as vectors

library(dplyr)

sapply(
  as_tibble(Mydata) %>% 
    mutate(grp = c(F, diff(value) < 0), 
           con = consecutive_id(grp), 
           con = if_else(!grp & lead(con, default=F) != con, con + 1, con)) %>%
    filter(any(grp) & first(value) >= 40 & last(value) <= 20, .by = con) %>%
    group_split(con), "[", 1)
$value
[1] 40 30 25 20  7

$value
[1] 44 33 11

score 1 · Answer 5 · answered Apr 24 '23 at 19:52

Try this sequence:

step1 <- Mydata[cumsum(Mydata >= 40) > 0]
step2 <- step1[cumsum(step1 != cummin(step1)) < 1]
step2
# [1] 40 30 25 20  7

It'll be up to you to determine if step2[length(step2)] (aka tail(step2,1)) is <= 20; if it is, you're good, if not then there is not a path to get there (I think).

Walk-through:

Step 1, start with 40:

Mydata[cumsum(Mydata >= 40) > 0]
#  [1] 40 30 25 20  7 34 23 55 70 42 38 22 44 33 11 17 25
step1 <- Mydata[cumsum(Mydata >= 40) > 0]

Step 2, we can use cummin (cumulative minimum) along the vector to find what the running-minimum is:

cummin(step1)
#  [1] 40 30 25 20  7  7  7  7  7  7  7  7  7  7  7  7  7

and with this, look for the cumulative occurrence where this is the real value.

cumsum(step1 != cummin(step1)) < 1
#  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
step1[cumsum(step1 != cummin(step1)) < 1]
# [1] 40 30 25 20  7

We need to use the cumsum(.) < 1 step since if one of the following values actually matches, we could get an inadvertent match, as in

step1[11] <- 7
step1 == cummin(step1)
#  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
step1[step1 == cummin(step1)]
# [1] 40 30 25 20  7  7

which is clearly not in the original data.

Extract decreasing sequences with certain start and end values

5 Answers5