0

I have a dataframe, which is sorted in descending order by the abc column (and needs to remain that way). So, beginning with the first row (i.e., where ID is 1259 and value is 4054), I want to keep the next row that has a lower value in the value column, which would be row 2 (i.e., value is 3902). Then, based on that value in row 2 (3902), I would then want to select the next lowest value row (i.e., 3801 in row 4). Then, based on that value in row 4 (3801), I would select the next lowest value row (i.e., rows 5-7 with a value of 1314), and so on. I also want to retain rows that are selected that have duplicate values (e.g., 1314 value for rows 5-7).

Data

# A tibble: 42 × 3
      ID   abc value
   <int> <dbl> <dbl>
 1  1259  4530 4054.
 2  1617  4415 3902.
 3  2441  4250 5286.
 4   539  4161 3801.
 5  1471  4012 1314.
 6   687  3934 1314.
 7  2436  3900 1314.
 8  1295  3591 3558.
 9  1738  3493 1188.
10  1747  3366 3902.
# … with 32 more rows

df <- structure(list(ID = c(1259L, 1617L, 2441L, 539L, 1471L, 687L, 
2436L, 1295L, 1738L, 1747L, 672L, 2187L, 2214L, 738L, 1784L, 
726L, 1712L, 2961L, 137L, 1188L, 1824L, 2856L, 953L, 2579L, 2010L, 
484L, 2769L, 1736L, 2994L, 1572L, 586L, 2462L, 2180L, 1404L, 
1472L, 1007L, 3000L, 545L, 2098L, 2029L, 2557L, 1623L), abc = c(4530, 
4415, 4250, 4161, 4012, 3934, 3900, 3591, 3493, 3366, 3248, 3172, 
3172, 3004, 3004, 2939, 2853, 2782, 2541, 2439, 2371, 2116, 2005, 
1999, 1838, 1749, 1611, 1573, 1416, 1301, 1177, 1074, 975, 805, 
729, 680, 538, 494, 309, 297, 155, 0), value = c(4053.69672793727, 
3902.10003271411, 5286.47420784735, 3801.25160651053, 1314.10007572174, 
1314.10007572174, 1314.10007572174, 3557.56126600696, 1188.20000553131, 
3902.10003271411, 1543.90001773834, 1543.90001773834, 1543.90001773834, 
1228.10004138947, 1228.10004138947, 1618.80007553101, 1618.80007553101, 
2787.70004177094, 2877.32907307533, 1289.59995269775, 2640.10003471375, 
1618.80007553101, 3557.56126600696, 3902.10003271411, 1228.10004138947, 
1329.40004730225, 1228.10004138947, 3557.56126600696, 1618.80007553101, 
1206.79998779297, 2117.7806730578, 1206.79998779297, 1206.79998779297, 
3260.02910244849, 1191.90001773834, 1191.90001773834, 1029.70001125336, 
1336.000041008, 1091.09995365143, 1234.89997673035, 1164.4999704361, 
1191.90001773834)), row.names = c(NA, -42L), class = c("tbl_df", 
"tbl", "data.frame"))

Expected Output

     ID   abc value
  <int> <dbl> <dbl>
1  1259  4530 4054.
2  1617  4415 3902.
3   539  4161 3801.
4  1471  4012 1314.
5   687  3934 1314.
6  2436  3900 1314.
7  1738  3493 1188.
8  3000   538 1030.

I know I could do something like this answer to find the next lowest value for each row, but I am unsure how to go through and filter. Since this can skip several rows, using lag and lead did not work for me. I would like to be able to do this with tidyverse or data.table or the apply family. I would like to avoid using a for loop, unless that's the only way. I have a larger dataset, so I am looking for an efficient way to do this.

AndrewGB
  • 16,126
  • 5
  • 18
  • 49

1 Answers1

4
library(dplyr)
filter(df, value == cummin(value))


# A tibble: 8 × 3
     ID   abc value
  <int> <dbl> <dbl>
1  1259  4530 4054.
2  1617  4415 3902.
3   539  4161 3801.
4  1471  4012 1314.
5   687  3934 1314.
6  2436  3900 1314.
7  1738  3493 1188.
8  3000   538 1030.
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • Well that's much simpler than I was trying to do it; I think I need to spend more time looking into the cumulative functions! – AndrewGB Sep 10 '22 at 03:24
  • 1
    Upvoted for its simplicity and elegance, and this works great with this data. For future readers, realize that if the numbers have fine levels of precision (many relevant decimal places), then this may produce inconsistent/incorrect results, and do so without warning or error. This would be related to floating-point comparisons (in any language, not just R); refs: https://stackoverflow.com/q/9508518/3358272, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f – r2evans Sep 10 '22 at 17:12
  • 1
    The more verbose way around this would be to look for a tolerance, something like `filter(df, abs(value - cummin(value)) < 1e-9)`, where the `1e-9` is context-sensitive. Even this might be imperfect if there might be a real value `1e-10` above the `cummin(value)`, meaning it would be incorrectly retained; this is why I say *"context-sensitive"* :-) – r2evans Sep 10 '22 at 17:13
  • 1
    Good point. More verbosely, we could use `df %>% mutate(value_rnd = round(value, 2)) %>% filter(value_rnd == cummin(value_rnd))` where we pick an appropriate precision in the `round` function depending on the context. – Jon Spring Sep 10 '22 at 20:15