3

I am looking to make rolling counts for multiple sites of instances that a threshold is exceeded.

A simplified version of my data:

        Dates SiteID Value
1  2015-04-01      A   9.1
2  2015-04-02      A   8.8
3  2015-04-02      A   7.9
4  2015-04-03      A   9.2
5  2015-04-08      A   9.3
6  2015-04-11      A   8.9
7  2015-04-11      A   9.2
8  2015-04-13      A   9.1
9  2015-04-16      A   7.8
10 2015-04-01      B   9.1
11 2015-04-01      B   8.8
12 2015-04-04      B   9.9
13 2015-04-05      B   7.8
14 2015-04-06      B   9.8
15 2015-04-06      B   9.2
16 2015-04-07      B   9.1
17 2015-04-08      B   8.5
18 2015-04-15      B   9.1

If the rolling period is 3 days and the threshold for 'Value' is 9, I am looking for a new column, 'Exceedances', that counts the number of times 'Value' was greater than 9 in the last 3 days at a given 'SiteID'. So this would look like:

        Dates SiteID Value Exceedances
1  2015-04-01      A   9.1           1
2  2015-04-02      A   8.8           1
3  2015-04-02      A   7.9           1
4  2015-04-03      A   9.2           2
5  2015-04-08      A   9.3           1
6  2015-04-11      A   8.9           0
7  2015-04-11      A   9.2           1
8  2015-04-13      A   9.1           2
9  2015-04-16      A   7.8           0
10 2015-04-01      B   9.1           1
11 2015-04-01      B   8.8           1
12 2015-04-04      B   9.9           1
13 2015-04-05      B   7.8           1
14 2015-04-06      B   9.8           2
15 2015-04-06      B   9.2           3
16 2015-04-07      B   9.1           3
17 2015-04-08      B   8.5           3
18 2015-04-15      B   9.1           1

DT = structure(list(r = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L), Dates = structure(c(16526, 16527, 
16527, 16528, 16533, 16536, 16536, 16538, 16541, 16526, 16526, 
16529, 16530, 16531, 16531, 16532, 16533, 16540), class = "Date"), 
    SiteID = c("A", "A", "A", "A", "A", "A", "A", "A", "A", "B", 
    "B", "B", "B", "B", "B", "B", "B", "B"), Value = c(9.1, 8.8, 
    7.9, 9.2, 9.3, 8.9, 9.2, 9.1, 7.8, 9.1, 8.8, 9.9, 7.8, 9.8, 
    9.2, 9.1, 8.5, 9.1), Exceedances = c(1L, 1L, 1L, 2L, 1L, 
    0L, 1L, 2L, 0L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 1L)), .Names = c("r", 
"Dates", "SiteID", "Value", "Exceedances"), row.names = c(NA, 
-18L), class = "data.frame")

I have seen similar questions that use data.table and deplyr but none have addressed counting exceedances of thresholds.

Ultimately this will be applied to very large datasets so methods that will be fastest are appreciated. And in case this makes a difference on recommendations, I will also be applying this for a rolling year rather than the 3 day example above, and the dataset will contain 'NA's.

Frank
  • 66,179
  • 8
  • 96
  • 180
Mark H
  • 41
  • 3
  • 1
    Rows 5 and 18 conflict: both have a Value > 9 and no obs for three days preceding but one gets result of 1 while the other gets 0...? Btw, you should probably actually show an example with NAs if you expect answers to address that case. – Frank Feb 02 '17 at 18:11
  • Does the order of the "Dates" matter? Because, for both rows 6 and 7, the previous 3 days, there is 1 value "> 9". Why is this not counted for row 6? – alexis_laz Feb 02 '17 at 19:42
  • 1
    The `structure` part I added to the question is so that it is reproducible. Please maintain it as you make changes or find some other way to keep it reproducible. Some guidance: http://stackoverflow.com/a/28481250/ – Frank Feb 02 '17 at 19:55

4 Answers4

3

Since the row number seems to matter, we can add it as a column:

library(data.table)
setDT(DT)

DT[, r := rowid(SiteID)]
setcolorder(DT, c("r", setdiff(names(DT), "r")))

Then you can do a non-equi join to count rows meeting the criterion:

DT[, v := 
  DT[.(SiteID = SiteID, rtop = r, d0 = Dates - 3, d1 = Dates), 
    on=.(SiteID, r <= rtop, Dates > d0, Dates <= d1), 
    sum(Value > 9), by=.EACHI]$V1
]


    r      Dates SiteID Value Exceedances v
 1: 1 2015-04-01      A   9.1           1 1
 2: 2 2015-04-02      A   8.8           1 1
 3: 3 2015-04-02      A   7.9           1 1
 4: 4 2015-04-03      A   9.2           2 2
 5: 5 2015-04-08      A   9.3           1 1
 6: 6 2015-04-11      A   8.9           0 0
 7: 7 2015-04-11      A   9.2           1 1
 8: 8 2015-04-13      A   9.1           2 2
 9: 9 2015-04-16      A   7.8           0 0
10: 1 2015-04-01      B   9.1           1 1
11: 2 2015-04-01      B   8.8           1 1
12: 3 2015-04-04      B   9.9           1 1
13: 4 2015-04-05      B   7.8           1 1
14: 5 2015-04-06      B   9.8           2 2
15: 6 2015-04-06      B   9.2           3 3
16: 7 2015-04-07      B   9.1           3 3
17: 8 2015-04-08      B   8.5           3 3
18: 9 2015-04-15      B   9.1           1 1

There are some potential problems here:

  • You're counting days multiple times, but probably only want to know about #days, which is uniqueN(x.Dates[Value > 9]) instead of sum(Value > 9).
  • I suspect there's not a good reason to care about row order here. To drop that part, just exclude the parts about r and rtop.

Regarding how it works, maybe review the vignettes and my answer to a similar question here.

Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    heh, i was also, trying to figure out whether the order indeed matters :) – alexis_laz Feb 02 '17 at 20:09
  • 1
    @Frank, after some discussion with my co-workers we have decided that the date order (row order) is not a concern for us. Your suspicions were correct :) – Mark H Feb 03 '17 at 14:43
1

We can use sqldf to formulate the problem as a complex left join:

library(sqldf)

sqldf("select a.*, sum(b.Value > 9) exceed
       from DT a
            left join DT b on a.SiteID = b.SITEID and 
                              b.Dates > a.Dates - 3 and
                              b.rowid <= a.rowid
       group by a.rowid")

giving:

        Dates SiteID Value exceed
1  2015-04-01      A   9.1      1
2  2015-04-02      A   8.8      1
3  2015-04-02      A   7.9      1
4  2015-04-03      A   9.2      2
5  2015-04-08      A   9.3      1
6  2015-04-11      A   8.9      0
7  2015-04-11      A   9.2      1
8  2015-04-13      A   9.1      2
9  2015-04-16      A   7.8      0
10 2015-04-01      B   9.1      1
11 2015-04-01      B   8.8      1
12 2015-04-04      B   9.9      1
13 2015-04-05      B   7.8      1
14 2015-04-06      B   9.8      2
15 2015-04-06      B   9.2      3
16 2015-04-07      B   9.1      3
17 2015-04-08      B   8.5      3
18 2015-04-15      B   9.1      1
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Fyi, your result doesn't look like the OP's. – Frank Feb 02 '17 at 18:42
  • Thanks for the quick answer but as others have pointed out this solution does not give the desired results. This is performing the count based on the previous 3 entries rather than the previous days based on the date. This does not work because there are multiple entries for some days and also gaps where days are missing. – Mark H Feb 02 '17 at 19:07
  • 1
    @Mark Please review and/or explain your expected output more carefully. I think there are a few further inconsistencies, like why do rows 14 and 15 have different values, even though they pertain to the same date? Ditto for rows 6 & 7. – Frank Feb 02 '17 at 19:10
  • @Frank Rows 6/7 were not what I intended. I have corrected them to 0 and 1, respectively. Your question still remains though. The expected output contained different values for the same date because I expected them to be increased as additional lines were encountered. I now see that some solutions may not analyze the data in this way and instead look for all the days together before providing the count. – Mark H Feb 02 '17 at 19:47
  • Have revised answer. – G. Grothendieck Feb 02 '17 at 22:46
1

Here is an answer using data.table. Simple, probably quick. It uses shift to get the previous two rows' Value, changing NAs to zeros(for the first two in each group), gives a 1 for >9 and 0 for <9, then adds them (including 1 or 0 for current row).

library(data.table)
setDT(dt)    
dt[, newCol := ifelse(shift(Value, n=1, fill=0)>9, 1,0)+ ifelse(shift(Value, n=2, fill=0)>=, 1, 0)+ ifelse(Value>9, 1, 0), by=SiteID]

per Frank's comment:

dt[, newCol := (shift(Value, n=1, fill=0)>9)+ (shift(Value, n=2, fill=0)>9) + (Value>9), by=SiteID]

also works

moman822
  • 1,904
  • 3
  • 19
  • 33
  • You don't need ifelse, just add logicals directly and they'll be treated as 1/0. Try `(3 > 4) + 2`. Also, `shift` can take a vector of values, like `n=0:2` – Frank Feb 02 '17 at 18:58
  • I don't think using a vector in `shift` will work because the T/F values need to be added and can't use sum, which provides a sum of the entire column. – moman822 Feb 02 '17 at 19:10
  • You can use `Reduce(\`+\`, shift(...))` maybe. – Frank Feb 02 '17 at 19:11
1

Accounting for the fact that the order of the "Dates" column matters, a way seems to be:

thres = 9; n = 3       
do.call(rbind, lapply(split(DT, DT$SiteID),
                      function(d) {
                          cs = cumsum(d$Value >= thres)
                          i = findInterval(d$Dates - (n - 1), d$Dates, left.open = TRUE)
                          cbind(d, exceed = cs - c(rep_len(0, sum(!i)), cs[i]))
                      }))
#     r      Dates SiteID Value Exceedances exceed
#A.1  1 2015-04-01      A   9.1           1      1
#A.2  2 2015-04-02      A   8.8           1      1
#A.3  3 2015-04-02      A   7.9           1      1
#A.4  4 2015-04-03      A   9.2           2      2
#A.5  5 2015-04-08      A   9.3           1      1
#A.6  6 2015-04-11      A   8.9           0      0
#A.7  7 2015-04-11      A   9.2           1      1
#A.8  8 2015-04-13      A   9.1           2      2
#A.9  9 2015-04-16      A   7.8           0      0
#B.10 1 2015-04-01      B   9.1           1      1
#B.11 2 2015-04-01      B   8.8           1      1
#B.12 3 2015-04-04      B   9.9           1      1
#B.13 4 2015-04-05      B   7.8           1      1
#B.14 5 2015-04-06      B   9.8           2      2
#B.15 6 2015-04-06      B   9.2           3      3
#B.16 7 2015-04-07      B   9.1           3      3
#B.17 8 2015-04-08      B   8.5           3      3
#B.18 9 2015-04-15      B   9.1           1      1
alexis_laz
  • 12,884
  • 4
  • 27
  • 37