1

I have a series of daily values, y. For each day, di (i.e., each row), I would like to calculate the (graph) area, ai, of the region between the curve and the horizontal line y = yi between di and the most recent previous occurrence of the value yi. Sketch below. Because observations occur at regular, discrete timesteps (daily), the calculated area, ai, is equivalent to the sum of the daily differences between each daily y and yi (black bars in figure). I'm interested only in valleys, so the calculated area, ai, can be set to 0 when y is decreasing (yi - yi-1 <= 0).

enter image description here

Toy data below. Expected result shown in dat$a. dat$a[6] was calculated from 55 - 50;
dat$a[7] was calculated from (60-55)+(60-50). And so on.

dat = data.frame(d = seq.Date(as_date("2021-01-01"),as_date("2021-01-10"),by = "1 day"),
             y = c(100,95,90,70,50,55,60,75,85,90),
             a = c(0,0,0,0,0,5,15,65,115,145))

My first thought was to calculate the area between the curve and the horizontal line y = yi between days di and the the most recent previous occurrence of the value yi, using perhaps geiger::area.between.curves(), but I couldn't work out how to identify most recent previous occurrence of the value yi.

[In case the context helps, the actual data are daily values of the area (m2) of a wetland not submerged by water. When the water rises, a portion of the wetland that had been dry for some time becomes wet. Here, I'm trying to calculate the extent of the reflooding in m2-days. A portion of the wetland that has been dry for a long time but becomes reflooded will contribute many m2-days to the sum.]

I'm most comfortable in the tidyverse, and such answers are greatly preferred. I am not familiar with data.table.

Thanks in advance


Update

I was able to able to achieve my desired calculation in Excel, though it's brutally inelegant. Couple hundred rows in an example, linked below. Given that my real data are 180k rows, my poor machine hated the 18 million calculated cells. Though I can move on with my analysis, I am still very interested in an R solution. My implemented approach differs subtly from my imagined R approach in that it's summing 'horizontal rectangles', so to speak, each of the same (small) y-unit height, rather than 'vertical rectangles', each of unit width.

Here's the file.

fantanaman
  • 35
  • 8
  • Is *d*_2 simply the second date measurement: `dat$d[2]`? – Greg Oct 20 '21 at 19:43
  • for any given date, d: d_2 refers to the most recent prior date when y was equal to y_d. Poor nomenclature, clearly, but I couldn't think of anything better. – fantanaman Oct 20 '21 at 20:46
  • Hmmm...what's to prevent *d*_2 from occurring on the other side of a "hill" rather than a "valley"? Indeed, consider the following two locations on the curve: **(1)** the current location of *d*; and **(2)** the "far side" of the same "hill", just before the graph ends, where it is "level" with location (1). Once *d* advances to location (2), the new *d*_2 will occupy location (1); and the area between them will not be a valley but rather the (local) "peak", just before the end of the graph. – Greg Oct 21 '21 at 13:35
  • 1
    Substantially revised question to be more clear (I hope!). @Greg, in the event that the most recent prior occurrence of *y_i* (referred to as *d_2* in original version of question) is on the other side of a peak instead of a valley, then the value of *a_i* can be set to 0. This can be checked easily with *y_i* - *y_i-1*; if negative, *y* is falling down from a local peak, so *a_i* = 0. – fantanaman Oct 21 '21 at 16:18
  • Nice job on the clarification! By the way, must *y*_ *i* have previously occurred in your dataset? Or might it have fallen on the curve *between* two points? If the latter, how is your curve constructed? Is it composed of line segments between known points? A spline? Some form of smoothed interpolation? If the line segments, then all you need to do is search backwards for the first pair of points *P*_ 0 = (*d*_ 0, *y*_ 0) and *P*_ 1 = (*d*_ 1, *y*_ 1) such that *y*_ 0 <= *y*_ *i* <= *y*_ 1; and then use the linear formula *d*_ *k* = *d*_ 0 + (*y*_*i* - *y*_0)(*d*_1 - *d*_0)/(*y*_1 - *y*_0) ... – Greg Oct 21 '21 at 16:41
  • ...where the point *P*_ *k* = (*d*_ *k*, *y*_ *k* = *y*_ *i*) is the previous point of the same "height" as your current point *P*_ *i* = (*d*_ *i*, *y*_ *i*). Once you've used this formula (a simple rearrangement of the linear equation for the line segment) to solve for *d*_ *k*, then you have your limits of integration: from *a* = *d*_ *k* to *b* = *d*_ *i*. – Greg Oct 21 '21 at 16:44
  • *y_i* may not necessarily have occurred previously in the dataset, and may commonly fall between two points. However, the "curve" need only "exist" if mathematically or programmatically convenient, in which case, I would use line segments between known points; the real use case is daily data over 10-20 years - I am not necessarily concerned with identifying *d_k* to fractions of days. It would be satisfactory to take *d_k* as integer day (or row number for that matter) *d_0* from *P_0* which satisfies *y_ 0 <= y_ i <= y_ 1* (as I've done manually in the toy data). – fantanaman Oct 21 '21 at 22:00
  • I suppose I included "curve" in the question because area-between-curves was the general approach that occurred to me, but now that I think about it, it could be convenient to think in terms of rows of a dataframe, rather than *d* as continuous time. I actually need help implementing in R the "search backwards" that you mention. – fantanaman Oct 21 '21 at 22:07
  • I suppose my data could be treated as Riemann rectangles (https://en.wikipedia.org/wiki/Riemann_sum) with unit width in a sense, and the necessary functions are probably in https://stackoverflow.com/questions/4954507/calculate-the-area-under-a-curve and/or https://stackoverflow.com/questions/40080929/how-to-calculate-riemann-sums-in-r, but I don't know how to put the pieces together for my use case. – fantanaman Oct 21 '21 at 22:16

1 Answers1

1

Since the question is missing complete information we will compute the the area under the curve assuming that a day is one unit. Modify as appropriate for your specific problem.

library(pracma)

nr <- nrow(dat)
dat0 <- dat[c(1, 1:nr, nr), ]
dat0[c(1, nr), "y"] <- 0
with(dat0, abs(polyarea(as.numeric(d), y)))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks for this. It does calculate the area of the whole valley formed by the full dataset, but does not calculate the incremental area for each day/row. Hopefully my object is clarified in the edited question. – fantanaman Oct 22 '21 at 13:57