I've a df as under
+-----+------+--------+--------------------+------+---------+
| ID1 | ID2 | DOC_NO | DATE | COST | CLIENT |
+-----+------+--------+--------------------+------+---------+
| ABC | A123 | 1 | 2021-01-01 0:10:00 | 11 | ABC123 |
| DEF | B456 | 2 | 2021-01-01 0:10:00 | 12 | DEF256 |
| GHI | C789 | 3 | 2021-01-01 0:10:00 | 13 | GHI389 |
| JKL | D890 | 4 | 2021-01-01 0:10:00 | 14 | JKL490 |
| MNO | E012 | 5 | 2021-01-01 0:10:00 | 15 | MNO512 |
| ABC | A123 | 6 | 2021-01-01 0:15:00 | 11 | ABC623 |
| DEF | B456 | 7 | 2021-01-01 0:15:00 | 12 | DEF756 |
| GHI | C789 | 8 | 2021-01-01 0:15:00 | 13 | GHI889 |
| JKL | D890 | 9 | 2021-01-02 0:15:00 | 14 | JKL990 |
| MNO | E012 | 10 | 2021-01-03 0:15:00 | 15 | MNO1012 |
| ABC | A123 | 11 | 2021-01-03 0:20:00 | 10 | GHI890 |
| DEF | B456 | 12 | 2021-01-03 0:20:00 | 11 | JKL991 |
| GHI | C789 | 13 | 2021-01-03 0:20:00 | 12 | MNO1013 |
| JKL | D890 | 14 | 2021-01-03 0:20:00 | 13 | GHI891 |
| MNO | E012 | 15 | 2021-01-03 0:20:00 | 14 | JKL992 |
| ABC | A123 | 16 | 2021-01-03 0:20:00 | 12 | MNO1014 |
| DEF | B456 | 17 | 2021-01-03 0:20:00 | 13 | GHI892 |
| GHI | C789 | 18 | 2021-01-03 0:20:00 | 14 | JKL993 |
| JKL | D890 | 19 | 2021-01-03 0:20:00 | 15 | MNO1015 |
| MNO | E012 | 20 | 2021-01-03 0:20:00 | 16 | GHI893 |
| ABC | A123 | 21 | 2021-01-03 0:25:00 | 11 | ABC124 |
| DEF | B456 | 22 | 2021-01-03 0:25:00 | 12 | DEF257 |
| GHI | C789 | 23 | 2021-01-03 0:25:00 | 13 | GHI390 |
| JKL | D890 | 24 | 2021-01-03 0:25:00 | 14 | JKL491 |
| MNO | E012 | 25 | 2021-01-03 0:25:00 | 15 | MNO513 |
+-----+------+--------+--------------------+------+---------+
I want to group ID1 and ID2 and arrange the df by DOC_NO and DATE Post that I want to create a new column REFERENCE_COST, where the REFERENCE_COST is the highest cost with respect to time and DOC_NO arrangement, meaning if the COST increase with TIME and DOC_NO, the higher COST would now be set as a REFERENCE_COST So the new df would look as under:
+-----+------+--------+--------------------+------+---------+----------+
| ID1 | ID2 | DOC_NO | DATE | COST | CLIENT | REF_COST |
+-----+------+--------+--------------------+------+---------+----------+
| ABC | A123 | 1 | 2021-01-01 0:10:00 | 11 | ABC123 | 11 |
| DEF | B456 | 2 | 2021-01-01 0:10:00 | 12 | DEF256 | 12 |
| GHI | C789 | 3 | 2021-01-01 0:10:00 | 13 | GHI389 | 13 |
| JKL | D890 | 4 | 2021-01-01 0:10:00 | 14 | JKL490 | 14 |
| MNO | E012 | 5 | 2021-01-01 0:10:00 | 15 | MNO512 | 15 |
| ABC | A123 | 6 | 2021-01-01 0:15:00 | 11 | ABC623 | 11 |
| DEF | B456 | 7 | 2021-01-01 0:15:00 | 12 | DEF756 | 12 |
| GHI | C789 | 8 | 2021-01-01 0:15:00 | 13 | GHI889 | 13 |
| JKL | D890 | 9 | 2021-01-02 0:15:00 | 14 | JKL990 | 14 |
| MNO | E012 | 10 | 2021-01-03 0:15:00 | 15 | MNO1012 | 15 |
| ABC | A123 | 11 | 2021-01-03 0:20:00 | 10 | GHI890 | 11 |
| DEF | B456 | 12 | 2021-01-03 0:20:00 | 11 | JKL991 | 12 |
| GHI | C789 | 13 | 2021-01-03 0:20:00 | 12 | MNO1013 | 13 |
| JKL | D890 | 14 | 2021-01-03 0:20:00 | 13 | GHI891 | 14 |
| MNO | E012 | 15 | 2021-01-03 0:20:00 | 14 | JKL992 | 15 |
| ABC | A123 | 16 | 2021-01-03 0:20:00 | 12 | MNO1014 | 12 |
| DEF | B456 | 17 | 2021-01-03 0:20:00 | 13 | GHI892 | 13 |
| GHI | C789 | 18 | 2021-01-03 0:20:00 | 14 | JKL993 | 14 |
| JKL | D890 | 19 | 2021-01-03 0:20:00 | 15 | MNO1015 | 15 |
| MNO | E012 | 20 | 2021-01-03 0:20:00 | 16 | GHI893 | 16 |
| ABC | A123 | 21 | 2021-01-03 0:25:00 | 11 | ABC124 | 12 |
| DEF | B456 | 22 | 2021-01-03 0:25:00 | 12 | DEF257 | 13 |
| GHI | C789 | 23 | 2021-01-03 0:25:00 | 13 | GHI390 | 14 |
| JKL | D890 | 24 | 2021-01-03 0:25:00 | 14 | JKL491 | 15 |
| MNO | E012 | 25 | 2021-01-03 0:25:00 | 15 | MNO513 | 16 |
+-----+------+--------+--------------------+------+---------+----------+
No, I want to be able to compare the REFERENCE_COST with the COST and filter all rows where the COST was less than the REFERENCE_COST and also add two new columns DATE_LAST_REF_COST_MET & CLIENT_LAST_REF_COST_MET which shows the DATE of the REFERENCE_COST and the CLIENT number from that REFERENCE_COST So the resulting df would be as under:
+-----+------+--------+--------------------+------+---------+----------+------------------------+--------------------------+
| ID1 | ID2 | DOC_NO | DATE | COST | CLIENT | REF_COST | DATE_LAST_REF_COST_MET | CLIENT_LAST_REF_COST_MET |
+-----+------+--------+--------------------+------+---------+----------+------------------------+--------------------------+
| ABC | A123 | 11 | 2021-01-03 0:20:00 | 10 | GHI890 | 11 | 2021-01-01 0:15:00 | ABC623 |
| DEF | B456 | 12 | 2021-01-03 0:20:00 | 11 | JKL991 | 12 | 2021-01-01 0:15:00 | DEF756 |
| GHI | C789 | 13 | 2021-01-03 0:20:00 | 12 | MNO1013 | 13 | 2021-01-01 0:15:00 | GHI889 |
| JKL | D890 | 14 | 2021-01-03 0:20:00 | 13 | GHI891 | 14 | 2021-01-02 0:15:00 | JKL990 |
| MNO | E012 | 15 | 2021-01-03 0:20:00 | 14 | JKL992 | 15 | 2021-01-03 0:15:00 | MNO1012 |
| ABC | A123 | 21 | 2021-01-03 0:25:00 | 11 | ABC124 | 12 | 2021-01-03 0:20:00 | MNO1014 |
| DEF | B456 | 22 | 2021-01-03 0:25:00 | 12 | DEF257 | 13 | 2021-01-03 0:20:00 | GHI892 |
| GHI | C789 | 23 | 2021-01-03 0:25:00 | 13 | GHI390 | 14 | 2021-01-03 0:20:00 | JKL993 |
| JKL | D890 | 24 | 2021-01-03 0:25:00 | 14 | JKL491 | 15 | 2021-01-03 0:20:00 | MNO1015 |
| MNO | E012 | 25 | 2021-01-03 0:25:00 | 15 | MNO513 | 16 | 2021-01-03 0:20:00 | GHI893 |
+-----+------+--------+--------------------+------+---------+----------+------------------------+--------------------------+
This is what I was able to do :
df %>%
group_by(ID1, ID2) %>%
arrange(DATE, DOC_NO, .by_group = TRUE) %>%
mutate(diff = COST - lag(COST, default = first(COST)))%>%
mutate(REF_COST = case_when(diff < 0~lag(COST), TRUE~diff)) %>%
mutate(DATE_LAST_REF_COST_MET= case_when(diff < 0~lag(DATE), TRUE~DATE)) %>%
mutate(CLIENT_LAST_REF_COST_MET= case_when(diff < 0~lag(CLIENT), TRUE~CLIENT))
The limitation with this is that it doesnt change the REFERENCE_COST with DATE and DOC_NO while making the calculations
I'm not sure how do I achieve this