How do I prevent {data.table}foverlaps from feeding NA's into its any(...) call when executing on large datatables?

Question

First of all, a similar problem:

Foverlaps error: Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop

The story

I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event. In total we consider three events : AC, CO and MT.

The data

Edit 1:

Here are two example datasets that allow the execution of the code below. The code runs just fine for these sets. Once I have data that generates the error I'll make a second edit. Note that event.GN in the example dataset below is a data.table instead of a list

emissions.GN <- data.table(date.time=seq(ymd_hms("2016-01-01 00:00:00"), by="min",length.out = 1000000))
event.GN <- data.table(dat=seq(ymd_hms("2016-01-01 00:00:00"), by="15 mins", length.out = 26383))

Edit 2: I created a csv file containing the data event.GN that generates the error. The file has 26383 rows of one variable dat but only about 14000 are necessary to generate the error.

Edit 3: Up until the dat "2017-03-26 00:25:20" the function works fine. Right after adding the next record with dat "2017-03-26 01:33:46" the error occurs. I noticed that between those points there is more than 60 minutes. This means that between those two event times one or several emission records won't have corresponding events. This in turn will generate NA's that somehow get caught up in the any() call of the foverlaps function. Am I looking in the right direction?

The fluor emissions are stored in a large datatable (~1 million rows) called emissions.GN. Note that only the date.time (POSIXct) variable is relevant to my problem.

example of emissions.GN:

         date.time     fluor hall                  period        dt
 1: 2016-01-01 00:17:04 0.3044254   GN [2016-01-01,2016-02-21] -16.07373
 2: 2016-01-01 00:17:04 0.4368381   GN [2016-01-01,2016-02-21] -16.07373
 3: 2016-01-01 00:18:04 0.5655382   GN [2016-01-01,2016-02-21] -16.07395
 4: 2016-01-01 00:19:04 0.6542259   GN [2016-01-01,2016-02-21] -16.07417
 5: 2016-01-01 00:21:04 0.6579384   GN [2016-01-01,2016-02-21] -16.07462

The data of the three events is stored in three smaller datatables (~20 thousand records) contained in a list called events.GN. Note that only the dat (POSIXct) variable is relevant to my problem.

example of AC events (CO and MT are analogous):

events.GN[["AC"]]

              dat hall numevt                                              txtevt
1: 2016-01-01 00:04:54   GN    321     PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)
2: 2016-01-01 00:09:21   GN    321     PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)
3: 2016-01-01 00:38:53   GN    321     PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)
4: 2016-01-01 02:30:33   GN    321     PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)
5: 2016-01-01 02:34:11   GN    321     PHASE 1 CHANGEMENT D'ANODE (Position anode @1I)

The function

I have written a function that applies foverlaps on a given (large) x datatable and a given (small) y datatable. The function returns a datatable with two columns. The first column yid contains the indices of emissions.GN observations that overlap at least once with an event. The second column N contains the overlap count (i.e. the number of times an overlap occurs for that particular index). The index of emissions that have zero overlaps are omitted from the result.

# A function to compute the number of times an emission record falls between the defined starting point and end point of an event.  
find_index_and_count <- function(hall,event, lower.margin=10, upper.margin=30){
# Define start and stop variables of the large emission dataset hall to be zero, i.e. each record is a single time point, not an interval.
hall$start <- hall$date.time
hall$stop <- hall$date.time
# Define the start and stop variables of the small event datatables equal to the defined margins oof 10 and 30 minutes respectively
event$start <- event$dat-minutes(lower.margin)
event$stop <- event$dat+minutes(upper.margin)
# Set they key of both datasets to be start and stop
setkey(hall,start,stop)
setkey(event,start,stop)
# Returns the index the of the emission record that falls N times within an event time interval. The call to na.omit is necessary to remove NA's introduced by x records that don't fall within any y interval.
foverlaps(event,hall,nomatch = NA, which = TRUE)[, .N, by=yid] %>% na.omit
}

The function executes succesfully for the events AC and CO

The function gives the desired result as discribed above when called on the events AC and CO:

find_index_and_count(emissions.GN,events.GN[["AC"]])

   yid N
 1:       1 1
 2:       2 1
 3:       3 1
 4:       4 1
 5:       5 2
---

find_index_and_count(emissions.GN,events.GN[["CO"]])

yid N
 1:       3 1
 2:       4 1
 3:       5 1
 4:       6 1
 5:       7 1
---

The function returns an error when called on the MT event

The following function call results in the error below:

find_index_and_count(emissions.GN,events.GN[["MT"]])

Error in if (any(x[[xintervals[2L]]] - x[[xintervals[1L]]] < 0L)) stop("All entries in column ", : missing value where TRUE/FALSE needed

5.foverlaps(event, hall, nomatch = NA, which = TRUE)

4.eval(lhs, parent, parent)

3.eval(lhs, parent, parent)

2.foverlaps(event, hall, nomatch = NA, which = TRUE)[, .N, by = yid] %>% na.omit

1.find_index_and_count(emissions.GN, events.GN[["MT"]])

I assume the function returns an NA whenever a record in x (emissions.FN) has no overlap with any of the events in y (events.FN[["AC"]] etc.).
I don't understand why the function fails on the event MT when it works just fine for AC and CO. The data are exactly the same with the exception of the values and slightly different number of records.

What I have tried so far

Firstly, In the similar problem linked above, someone pointed out the following idea:

This often indicates an NA value being fed to the any function, so it returns NA and that's not a legal logical value. – Carl Witthoft May 7 '15 at 13:50

Hence , I modified the call to foverlaps to return 0 instead of NA whener no overlap between x and y is found, like this:

foverlaps(event,hall,nomatch = 0, which = TRUE)[, .N, by=yid] %>% na.omit

This did not change anything (the function works for AC and CO but fails for MT).

Secondly, I made absolutely sure that none of my datatables contained NA's.

More information

If required I can provide the SQL code that generates the emissions.FN data and all the events.FN data. Note that because all the events.FN date has the same origin, there should be no diffirences (other than the values) between the data of the events AC, CO and MT.
If anything else is required, please do feel free to ask !

Typically a good example on this site would be reproducible (so we can copy paste code into a fresh R console and see the problem) and minimal (without extraneous info). Up to you if you want to edit in that direction (and I'm not saying that I personally could solve it if you did), but anyway some guidance here: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 Note that we don't necessarily need/want your actual data, just some data and code that illustrates the problem. — Frank, Apr 27 '18 at 14:20
Hey, thank you for your imput. It's hard for me to reproduce the error with different data but I'm working on it. I'll add what I have so far. — , Apr 27 '18 at 14:57

score 0 · Accepted Answer · answered Apr 27 '18 at 16:24

I'm trying to count how many times fluor emissions (measured every 1 minute) overlap with a given event. An emission is said to overlap with a given event when the emission time is 10 minutes before or 30 minutes after the time of the event.

Just addressing this objective (since I don't know foverlaps well.)...

event.GN[, n := 
  emissions.GN[.SD[, .(d_dn = dat - 10*60, d_up = dat + 30*60)], on=.(date.time >= d_dn, date.time <= d_up), 
    .N
  , by=.EACHI]$N
]

                       dat  n
    1: 2016-01-01 00:00:00 31
    2: 2016-01-01 00:15:00 41
    3: 2016-01-01 00:30:00 41
    4: 2016-01-01 00:45:00 41
    5: 2016-01-01 01:00:00 41
   ---                       
26379: 2016-10-01 18:30:00 41
26380: 2016-10-01 18:45:00 41
26381: 2016-10-01 19:00:00 41
26382: 2016-10-01 19:15:00 41
26383: 2016-10-01 19:30:00 41

To check/verify one of these counts...

> # dat from 99th event...
> my_d <- event.GN[99, {print(.SD); dat}]
                   dat  n
1: 2016-01-02 00:30:00 41
> 
> # subsetting to overlapping emissions
> emissions.GN[date.time %between% (my_d + c(-10*60, 30*60))]
              date.time
 1: 2016-01-02 00:20:00
 2: 2016-01-02 00:21:00
 3: 2016-01-02 00:22:00
 4: 2016-01-02 00:23:00
 5: 2016-01-02 00:24:00
 6: 2016-01-02 00:25:00
 7: 2016-01-02 00:26:00
 8: 2016-01-02 00:27:00
 9: 2016-01-02 00:28:00
10: 2016-01-02 00:29:00
11: 2016-01-02 00:30:00
12: 2016-01-02 00:31:00
13: 2016-01-02 00:32:00
14: 2016-01-02 00:33:00
15: 2016-01-02 00:34:00
16: 2016-01-02 00:35:00
17: 2016-01-02 00:36:00
18: 2016-01-02 00:37:00
19: 2016-01-02 00:38:00
20: 2016-01-02 00:39:00
21: 2016-01-02 00:40:00
22: 2016-01-02 00:41:00
23: 2016-01-02 00:42:00
24: 2016-01-02 00:43:00
25: 2016-01-02 00:44:00
26: 2016-01-02 00:45:00
27: 2016-01-02 00:46:00
28: 2016-01-02 00:47:00
29: 2016-01-02 00:48:00
30: 2016-01-02 00:49:00
31: 2016-01-02 00:50:00
32: 2016-01-02 00:51:00
33: 2016-01-02 00:52:00
34: 2016-01-02 00:53:00
35: 2016-01-02 00:54:00
36: 2016-01-02 00:55:00
37: 2016-01-02 00:56:00
38: 2016-01-02 00:57:00
39: 2016-01-02 00:58:00
40: 2016-01-02 00:59:00
41: 2016-01-02 01:00:00
              date.time

How do I prevent {data.table}foverlaps from feeding NA's into its any(...) call when executing on large datatables?

1 Answers1