I know the following problam can be solved using Bioconductor's IRanges-package, using reduce
.
But since that function only accepts numeric input, and I am working with data.table anyway, I am wondering is the following van be achieved using data.tables'foverlaps()
.
Sample data
structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B"
), subgroup = c(1, 1, 2, 2, 1, 1, 2, 2), start = structure(c(1514793600,
1514795400, 1514794200, 1514798100, 1514815200, 1514817000, 1514815800,
1514818800), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
end = structure(c(1514794500, 1514797200, 1514794800, 1514799000,
1514816100, 1514818800, 1514817600, 1514820600), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
# group subgroup start end
# 1: A 1 2018-01-01 08:00:00 2018-01-01 08:15:00
# 2: A 1 2018-01-01 08:30:00 2018-01-01 09:00:00
# 3: A 2 2018-01-01 08:10:00 2018-01-01 08:20:00
# 4: A 2 2018-01-01 09:15:00 2018-01-01 09:30:00
# 5: B 1 2018-01-01 14:00:00 2018-01-01 14:15:00
# 6: B 1 2018-01-01 14:30:00 2018-01-01 15:00:00
# 7: B 2 2018-01-01 14:10:00 2018-01-01 14:40:00
# 8: B 2 2018-01-01 15:00:00 2018-01-01 15:30:00
Question
What I would like to achieve, is to join/merge events (by group) when:
- a range (start - end) overlaps (or partially overlaps) another range
- the start of a range is the end of another range
Subgroups can be ignored
As mentioned above, I'm know this can be done using biocondustor's IRanges reduce
, but I wonder if the same can be achieved using data.table. I can't shake the feeling that foverlaps
should be able to tackle my problem, but I cannot figure out how...
Since I'm an intermediate R-user, but pretty much a novice in data.table, it's hard for me to 'read' some solutions already provided on stackoverflow. So I'm not sure if a similar quenstion has already been asked and answered (if so, please be gentle ;-) )
Desired output
structure(list(group = c("A", "A", "A", "B"), start = structure(c(1514793600,
1514795400, 1514798100, 1514815200), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), end = structure(c(1514794800, 1514797200,
1514799000, 1514820600), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
# group start end
# 1: A 2018-01-01 08:00:00 2018-01-01 08:20:00
# 2: A 2018-01-01 08:30:00 2018-01-01 09:00:00
# 3: A 2018-01-01 09:15:00 2018-01-01 09:30:00
# 4: B 2018-01-01 14:00:00 2018-01-01 15:30:00