Find the first incidence (row) of a value that is x amount greater/less than the current value (iterated through each row in a data frame)

Question

I've been trying my best, but not quite getting there. I'm trying to iterate through the value in a vector (df$sample) and find the first proceeding incidence of a value that is 20% less than the current value. I am trying to find this for each row (sample) and print the date of the found value to a new column.

Here's my df:

    date       sample
591 2020-02-14 0.008470
590 2020-02-15 0.008460
589 2020-02-16 0.007681
588 2020-02-17 0.007144
587 2020-02-18 0.007262
586 2020-02-19 0.007300
585 2020-02-20 0.006604
584 2020-02-21 0.006843
583 2020-02-22 0.006687
582 2020-02-23 0.006991
581 2020-02-24 0.007333
580 2020-02-25 0.006738
579 2020-02-26 0.006279
...

My shotty attempts have been to use Position() or which(). I thought maybe I could wrap either of them in a for loop, but my attempts are not quite right.

for(i in length(df){

df$conc20 <- Position(function(x) x < df$sample[i]*0.80, df$sample)
}

or

for(i in length(df){

df$conc20 <- min(which(df$sample < df$sample[i]*0.8)

}

I even found a dply example that got close to what I was looking for.

Ideally:

    date       sample   conc20
591 2020-02-14 0.008470 2020-02-25
590 2020-02-15 0.008460 ...
589 2020-02-16 0.007681
588 2020-02-17 0.007144
587 2020-02-18 0.007262
...

Any clarification I'm happy to provide. I really do appreciate the help!

Can you share the data frame with ```dput()```? It would make things easier. — Shibaprasadb, Oct 01 '21 at 09:34

R.S. · Accepted Answer · 2021-10-02T21:54:29.990

Edited Answer

df<- read.csv( sep = " ",  text=
                 "row date sample
591 2020-02-14 0.008470
590 2020-02-15 0.008460
589 2020-02-16 0.007681
588 2020-02-17 0.007144
587 2020-02-18 0.007262
586 2020-02-19 0.007300
585 2020-02-20 0.006604
584 2020-02-21 0.006843
583 2020-02-22 0.006687
582 2020-02-23 0.006991
581 2020-02-24 0.007333
580 2020-02-25 0.006738
579 2020-02-26 0.006279",                    
)
df$date=as.Date(as.character(df$date))
df   

#there is no row 20% below, so I am just using 2% below 
# and multiplying 0.98 instead of 0.8

# Finding cross-over before current row    
f_crossover_before<- function(  i  ){
  cutoff= 0.98* df$sample[i]
  res<- max(which( df$sample[1:i]<= cutoff), -1)
  ifelse ( (res>0) , res , NA )  # sapply cannot return dates !
}

# Finding cross-over after  current row   
f_crossover_after<- function(  i  ){
  cutoff<- 0.98* df$sample[i]
  res<- min( i+which( df$sample[(i+1):nrow(df)]<= cutoff), 
        .Machine$integer.max )
  ifelse ( (res<.Machine$integer.max) , res , NA )
}



# A column for  comparison. Only for visual inspection 
df$cutoff<- df$sample*0.98 


df$crossover_before<- sapply( seq_along(df$sample) ,  FUN = f_crossover_before )
df$crossover_before<- df$date[df$crossover_before]

df$crossover_after<- sapply( seq_along(df$sample) ,  FUN = f_crossover_after)
df$crossover_after<- df$date[df$crossover_after]




#View(df)

Output :

#   row       date   sample     cutoff crossover_before crossover_after
#1  591 2020-02-14 0.008470 0.00830060             <NA>      2020-02-16
#2  590 2020-02-15 0.008460 0.00829080             <NA>      2020-02-16
#3  589 2020-02-16 0.007681 0.00752738             <NA>      2020-02-17
#4  588 2020-02-17 0.007144 0.00700112             <NA>      2020-02-20
#5  587 2020-02-18 0.007262 0.00711676             <NA>      2020-02-20
#6  586 2020-02-19 0.007300 0.00715400       2020-02-17      2020-02-20
#7  585 2020-02-20 0.006604 0.00647192             <NA>      2020-02-26
#8  584 2020-02-21 0.006843 0.00670614       2020-02-20      2020-02-22
#9  583 2020-02-22 0.006687 0.00655326             <NA>      2020-02-26
#10 582 2020-02-23 0.006991 0.00685118       2020-02-22      2020-02-25
#11 581 2020-02-24 0.007333 0.00718634       2020-02-23      2020-02-25
#12 580 2020-02-25 0.006738 0.00660324             <NA>      2020-02-26
#13 579 2020-02-26 0.006279 0.00615342             <NA>            <NA>

What a succinct little function! You're good at interpreting what I'm trying to articulate. I appreciate that. At the moment this is returning all "-1" in my crossover column. Which is curious because for every value in "sample" there is absolutely a proceeding value less than 2% or even 20%. But any adjusted value only returns -1. I would like to report only the next proceeding incidence of a value that is less than (e.g.) 0.98*df$sample. (Next, proceeding by order of ascending dates). I'll play with it for sure. — Bo_0, Oct 01 '21 at 23:44
I tried my best so far with Mixed results. Majority of the returns are still -1, with occasional row numbers being reported from ascending or descending locations. Any thoughts on adapting this to keep sequencing along df$sample until it finds the next proceeding value that is 2% less than current value, and reports that row number? — Bo_0, Oct 02 '21 at 08:13
@Bo_0 I think I had written the answer before you added some clarification Edit to question . In any case, I'm not very sure if the requirement is to find matching rows before the current date or after it . I've rewritten the code to include both the scenarios above. — R.S., Oct 02 '21 at 22:02

Jarn Schöber · Answer 2 · 2021-10-03T12:17:02.917

1

Quite messy, but this should do the trick

library(dplyr)
df<- read.csv( sep = " ",  text=
                 "row date sample
591 2020-02-14 0.008470
590 2020-02-15 0.008460
589 2020-02-16 0.007681
588 2020-02-17 0.007144
587 2020-02-18 0.007262
586 2020-02-19 0.007300
585 2020-02-20 0.006604
584 2020-02-21 0.006843
583 2020-02-22 0.006687
582 2020-02-23 0.006991
581 2020-02-24 0.007333
580 2020-02-25 0.006738
579 2020-02-26 0.006279", 
               
)

x <- 1.05

df <- df %>%
  mutate(id =  1:n()) %>% 
  rowwise %>% 
  mutate(greater_row = 
           first(which(sample*x <
                         df$sample[id:nrow(df)]) + 
                   id-1))
df$greater_row <- df$date[df$greater_row]

This should allow you to set x to any factor you want you want

edited Oct 03 '21 at 12:17

answered Oct 01 '21 at 12:16

Jarn Schöber

309
1
8

This is great. I really do appreciate this approach. Currently this is returning the first date of ascending or descending origin. Of which, is there a way to adapt this to report only the first proceeding date? I wasn't very clear on that, that's my bad. I'll play with it. – Bo_0 Oct 01 '21 at 23:10
That's weird, when I run the code it always returns the first valid date (row) after the "current" one. If you want to change the order you should simply reorder the data.frame so the dates are ascending... – Jarn Schöber Oct 03 '21 at 12:19
Copy that. Thank you for your help! – Bo_0 Oct 03 '21 at 21:01
Hi, this was exactly what I was looking for BUT seems no longer working example. Getting `Problem while computing `greater_row = first(which(sample * x < df$sample[id:nrow(df)]) + id - 1). greater_row must be size 1, not 0.` on example data. What could be wrong? Thank you – Bury Mar 16 '23 at 11:01

Uwe · Answer 3 · 2021-10-04T07:15:33.297

If I understand correctly, this can be solved by a non-equi self join using two helper columns:

library(data.table)
setDT(df)[, rn := .I][, threshold := 0.8 * sample][
  , conc20 := df[df, on = .(rn > rn, sample < threshold), mult = "first", x.date]][
    , c("rn", "threshold") := NULL][]

          date   sample     conc20
 1: 2020-02-14 0.008470 2020-02-20
 2: 2020-02-15 0.008460 2020-02-20
 3: 2020-02-16 0.007681 2020-02-27
 4: 2020-02-17 0.007144 2020-02-27
 5: 2020-02-18 0.007262 2020-02-27
 6: 2020-02-19 0.007300 2020-02-27
 7: 2020-02-20 0.006604       <NA>
 8: 2020-02-21 0.006843 2020-02-27
 9: 2020-02-22 0.006687 2020-02-27
10: 2020-02-23 0.006991 2020-02-27
11: 2020-02-24 0.007333 2020-02-27
12: 2020-02-25 0.006738 2020-02-27
13: 2020-02-26 0.006279       <NA>
14: 2020-02-27 0.005300       <NA>

Explanation

The first condition in the on = clause ensures that only succeeding rows are considered, the second condition looks for sample < threshold where threshold has been defined beforehand as 80% of sample. The helper column rn contains row numbers (created via the data.table special symbol .I). In addition, mult = "first" tells to pick the first occurrence in case of multiple matches.

The result is appended as additional column conc20 by reference, i.e., without copying the whole dataset. Finally, the two helper columns are removed by reference.

Note that data.table chaining is used.

For demonstration, the result of the non-equi self join including all helper columns can be shown:

setDT(df)[, rn := .I][, threshold := 0.8 * sample][
  df, on = .(rn > rn, sample < threshold), mult = "first"]

          date    sample rn threshold     i.date i.sample
 1: 2020-02-20 0.0067760  1 0.0052832 2020-02-14 0.008470
 2: 2020-02-20 0.0067680  2 0.0052832 2020-02-15 0.008460
 3: 2020-02-27 0.0061448  3 0.0042400 2020-02-16 0.007681
 4: 2020-02-27 0.0057152  4 0.0042400 2020-02-17 0.007144
 5: 2020-02-27 0.0058096  5 0.0042400 2020-02-18 0.007262
 6: 2020-02-27 0.0058400  6 0.0042400 2020-02-19 0.007300
 7:       <NA> 0.0052832  7        NA 2020-02-20 0.006604
 8: 2020-02-27 0.0054744  8 0.0042400 2020-02-21 0.006843
 9: 2020-02-27 0.0053496  9 0.0042400 2020-02-22 0.006687
10: 2020-02-27 0.0055928 10 0.0042400 2020-02-23 0.006991
11: 2020-02-27 0.0058664 11 0.0042400 2020-02-24 0.007333
12: 2020-02-27 0.0053904 12 0.0042400 2020-02-25 0.006738
13:       <NA> 0.0050232 13        NA 2020-02-26 0.006279
14:       <NA> 0.0042400 14        NA 2020-02-27 0.005300

Data

library(data.table)
df <- fread("
i   date       sample
591 2020-02-14 0.008470
590 2020-02-15 0.008460
589 2020-02-16 0.007681
588 2020-02-17 0.007144
587 2020-02-18 0.007262
586 2020-02-19 0.007300
585 2020-02-20 0.006604
584 2020-02-21 0.006843
583 2020-02-22 0.006687
582 2020-02-23 0.006991
581 2020-02-24 0.007333
580 2020-02-25 0.006738
579 2020-02-26 0.006279
580 2020-02-27 0.005300
", drop = 1L)

That's great. Thank you so much for your help! -and your breakdown. This gives me a new avenue of approach to play with. — Bo_0, Oct 03 '21 at 21:05

Find the first incidence (row) of a value that is x amount greater/less than the current value (iterated through each row in a data frame)

3 Answers3

Explanation

Data