3

I'm lookin for help on a question very closely related to optimized rolling functions on irregular time series with time-based window.

I want to calculate my function (in my example below it is sum, but could just as easily be min or max or median...) on the timesteps up to a certain number of minutes ago, but not beyond this number. So if my duration of interest is 8 mins and I have timesteps of 2,2,5 for the third time I only want to sum the values for the last 2 numbers (which gives a total time of 7) not including the first which would give a total time of 9. So basically I want as close to my duration as possible without going over.

For context the values are rainfall data and mindiff is the recording interval. So if I have a 5 min timestep the rain could have occurred at any time in that 5 mins. I want to find only the rain I can guarantee has fallen in the last 8 mins, so any time step that starts before 8 mins ago I cannot be sure and I want to exclude.

The code below is SO close to what I want (adjusted from Uwe's answer to question mentioned above) but it includes one point before unless the interval precisely equals my duration. I'd also like it to go to NA if the timestep is greater than the duration, but that's very easy to implement afterwards.

Example with desired output at the end:

  library(tibble)
  library(data.table)


set.seed(1)
testdf <- tibble(mindiff = c(1,2,1,2,5,2,2,2,11,5,3,2,0,1),
                 DateTime = as.POSIXct("2019-01-01 00:00", z = "Etc/GMT-10")+(cumsum(mindiff)*60),
                 Value = round(runif(14,0,5),1))


dur8= 60*8


testdf[["Total8Min"]] <- setDT(testdf)[, rn := .I][.(ur = rn, ud = DateTime, ld = DateTime - dur8), 
                                              on = .(rn <= ur, DateTime <= ud, DateTime > ld),
                                              sum(as.double(Value)), by = .EACHI]$V1


testdf$desiredOut <- c(1.3,3.2,6.1,10.6,8.4,5.5,9.2,12.5,NA, 0.3,1.3,1.9,5.3,7.2)

Created on 2019-08-21 by the reprex package (v0.3.0)

I am a bit of a novice at data.table, I'm much more used to tidyverse so happy for a tidyverse approach if someone has one.

To explain further in response to comments: In my desired output on row 6 I only want the output to be 5.5 which is what has fallen in the last 2 mins (from 00:11 to 00:13) and in the 5 mins before that (from 00:06 to 00:11). I do not want to include the value in row 4 as this rain could have fallen any time from 00:04 to 00:06 - potentially all prior to 00:05 which is 8 mins before my "now" time on row 6.

Sarah
  • 3,022
  • 1
  • 19
  • 40
  • Are you sure if all your numbers in `desiredOut` is correct? For example, for row 6 output if we go 8 minutes back, we should add `4.5 + 4.5 + 1` which is 10 but your desired output only adds `4.5 + 1`. – Ronak Shah Aug 21 '19 at 04:28
  • So the point is that I don't want to include the value of row 4 in the output of row 6 as then the total time difference from row 4 to 6 (2+5+2) is 9 mins, which is over 8 mins. I want to exclude a row that is 8 mins back itself if its time difference means I am going beyond the 8 mins – Sarah Aug 21 '19 at 04:40
  • isn't row 4 just 13 - 6 (or 2 + 5) minutes back of row 6? – Ronak Shah Aug 21 '19 at 04:45
  • is it possible to dput something that does not depends on `dminutes`? – chinsoon12 Aug 21 '19 at 05:54
  • I have updated in response to @RonakShah to hopefully explain better – Sarah Aug 21 '19 at 06:01
  • I have changed the example so its not there @chinsoon12 – Sarah Aug 21 '19 at 06:04
  • please upvote related FR https://github.com/Rdatatable/data.table/issues/3241 – jangorecki Aug 21 '19 at 15:13

2 Answers2

3

Here is another possible approach using cumsum (DateTime turned out to be a red herring):

setDT(testdf)[, c("rn", "cs", "lagcs") := .(.I, cs <- cumsum(mindiff), cs - mindiff)]
testdf[, do := testdf[.(rn=rn, start=cs-8), on=.(rn<=rn, lagcs>=start),
    sum(Value), by=.EACHI]$V1]

output:

    mindiff Value desiredOut rn cs lagcs   do
 1:       1   1.3        1.3  1  1     0  1.3
 2:       2   1.9        3.2  2  3     1  3.2
 3:       1   2.9        6.1  3  4     3  6.1
 4:       2   4.5       10.6  4  6     4 10.6
 5:       5   1.0        8.4  5 11     6  8.4
 6:       2   4.5        5.5  6 13    11  5.5
 7:       2   4.7        9.2  7 15    13  9.2
 8:       2   3.3       12.5  8 17    15 12.5
 9:      11   3.1         NA  9 28    17   NA
10:       5   0.3        0.3 10 33    28  0.3
11:       3   1.0        1.3 11 36    33  1.3
12:       2   0.9        1.9 12 38    36  1.9
13:       0   3.4        5.3 13 38    38  5.3
14:       1   1.9        7.2 14 39    38  7.2

data:

library(data.table)
set.seed(1)
mindiff = c(1,2,1,2,5,2,2,2,11,5,3,2,0,1)
testdf <- data.table(mindiff = mindiff, Value = round(runif(14,0,5),1))
testdf$desiredOut <- c(1.3,3.2,6.1,10.6,8.4,5.5,9.2,12.5,NA, 0.3,1.3,1.9,5.3,7.2)
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
0

I think @chinsoon12 has got a better and scalable approach but since I had already started working on my solution I'll post it anyway.

library(dplyr)

testdf %>%
   mutate(row = row_number(),
          out = purrr::map2_dbl(DateTime, row, ~{
            inds <- which(DateTime > (.x - dur8) & DateTime <= .x & .y >= row)
            if (sum(mindiff[inds]) > 8) sum(Value[inds[-1L]]) else sum(Value[inds])
     })) %>%
    select(-row)


# A tibble: 14 x 5
#   mindiff DateTime            Value desiredOut   out
#     <dbl> <dttm>              <dbl>      <dbl> <dbl>
# 1       1 2019-01-01 00:01:00   1.3        1.3   1.3
# 2       2 2019-01-01 00:03:00   1.9        3.2   3.2
# 3       1 2019-01-01 00:04:00   2.9        6.1   6.1
# 4       2 2019-01-01 00:06:00   4.5       10.6  10.6
# 5       5 2019-01-01 00:11:00   1          8.4   8.4
# 6       2 2019-01-01 00:13:00   4.5        5.5   5.5
# 7       2 2019-01-01 00:15:00   4.7        9.2   9.2
# 8       2 2019-01-01 00:17:00   3.3       12.5  12.5
# 9      11 2019-01-01 00:28:00   3.1       NA     0  
#10       5 2019-01-01 00:33:00   0.3        0.3   0.3
#11       3 2019-01-01 00:36:00   1          1.3   1.3
#12       2 2019-01-01 00:38:00   0.9        1.9   1.9
#13       0 2019-01-01 00:38:00   3.4        5.3   5.3
#14       1 2019-01-01 00:39:00   1.9        7.2   7.2

Here, for every DateTime we take sum of Value which lies in the 8 minutes range.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213