identify consecutively overlapping segments in R

Question

I need to aggregate overlapping segments into a single segment ranging all connected segments.

Note that a simple foverlaps cannot detect connections between non overlapping but connected segments, see the example for clarification. If it would rain on my segments in my plot I am looking for the stretches of dry ground.

So far I solve this problem by an iterative algorithm but I'm wondering if there is a more elegant and stright forward way for this problem. I'm sure not the first one to face it.

I was thinking about a non-equi rolling join, but faild to implement that

library(data.table)
(x <- data.table(start = c(41,43,43,47,47,48,51,52,54,55,57,59), 
                  end = c(42,44,45,53,48,50,52,55,57,56,58,60)))

#     start end
#  1:    41  42
#  2:    43  44
#  3:    43  45
#  4:    47  53
#  5:    47  48
#  6:    48  50
#  7:    51  52
#  8:    52  55
#  9:    54  57
# 10:    55  56
# 11:    57  58
# 12:    59  60

setorder(x, start)[, i := .I] # i is just a helper for plotting segments
plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))

x$grp <- c(1,3,3,2,2,2,2,2,2,2,2,4) # the grouping I am looking for
do.call(segments, list(x$start, x$i, x$end, x$i, col = x$grp))
(y <- x[, .(start = min(start), end = max(end)), k=grp])

#    grp start end
# 1:   1    41  42
# 2:   2    47  58
# 3:   3    43  45
# 4:   4    59  60

do.call(segments, list(y$start, 12.2, y$end, 12.2, col = 1:4, lwd = 3))

EDIT:

That's brilliant, thanks, cummax & cumsum do the job, Uwe's Answer is slightly better than Davids comment.

end[.N] can get wrong results, try example data x below. max(end) is correct in all cases, and faster.

x <- data.table(start = c(11866, 12696, 13813, 14011, 14041), end = c(13140, 14045, 14051, 14039, 14045))
min(start) and start[1L] give the same (as x is ordered by start), the latter is faster.
grp on the fly is significantly faster, unfortunately I need grp assigned.
cumsum(cummax(shift(end, fill = 0)) < start) is significantly faster than cumsum(c(0, start[-1L] > cummax(head(end, -1L)))).
I did not test the package GenomicRanges solution.

`x[, .(start[1L], end[.N]), by = .(grp = cumsum(c(0, start[-1L] > cummax(head(end, -1L)))))]` could work. Basically a data.table version of my solution [here](https://stackoverflow.com/a/28938694/3001626) — David Arenburg, Aug 30 '18 at 08:46

Uwe · Accepted Answer · 2018-08-30T09:28:40.843

6

The OP has requested to aggregate overlapping segments into a single segment ranging all connected segments.

Here is another solution which uses cummax() and cumsum() to identify groups of overlapping or adjacent segments:

x[order(start, end), grp := cumsum(cummax(shift(end, fill = 0)) < start)][
  , .(start = min(start), end = max(end)), by = grp]

   grp start end
1:   1    41  42
2:   2    43  45
3:   3    47  58
4:   4    59  60

Disclaimer: I have seen that clever approach somewhere else on SO but I cannot remember exactly where.

Edit:

As David Arenburg has pointed out, it is not necessary to create the grp variable separately. This can be done on-the-fly in the by = parameter:

x[order(start, end), .(start = min(start), end = max(end)), 
  by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))]

Visualisation

OP's plot can be amended to show also the aggregated segments (quick and dirty):

plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x[order(start, end), .(start = min(start), end = max(end)), 
  by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))][
    , segments(start, grp + 0.5, end, grp + 0.5, "red", , 4)]

edited Aug 30 '18 at 09:28

answered Aug 30 '18 at 08:49

Uwe

41,420
11
90
134

You don't need to create `grp`, you can pass it directly into `by`. – David Arenburg Aug 30 '18 at 09:08
@DavidArenburg there is no `grp` provided, and that is the main aim of the question. Creating the group – Onyambu Aug 30 '18 at 09:10
You mean OP wants the `grp` in their original data? Perhaps you are right – David Arenburg Aug 30 '18 at 09:11
@Onyambu The OP has requested *to aggregate overlapping segments into a single segment ranging all connected segments.* – Uwe Aug 30 '18 at 09:16
I get it.. Although just saw OP wrote *the grouping I am looking for*. its still answers the solution – Onyambu Aug 30 '18 at 09:22

Roman · Answer 2 · 2018-08-30T09:55:56.807

5

You can try a GenomicRanges approach. In the output each row is a group.

library(GenomicRanges)
x_gr <- with(x, GRanges(1, IRanges(start, end)))
as.data.table(reduce(x_gr, min.gapwidth=0))[,2:3]
   start end
1:    41  42
2:    43  45
3:    47  58
4:    59  60

And a visual insepection can be done using Gviz. Here one has to know that the package has been built for biologists and genetic information. The pattern behind are DNA bases. Hence, one has to substract 1 of the segment ends to get the correct plot.

library(Gviz)
ga <- Gviz::GenomeAxisTrack()
xgr <- with(x, GRanges(1, IRanges(start, end = end - 1)))
xgr_red <- reduce(xgr, min.gapwidth=1)
ga <- GenomeAxisTrack()
GT <- lapply(xgr, GeneRegionTrack)
GT_red <- lapply(xgr_red, GeneRegionTrack, fill = "lightblue")
plotTracks(c(ga, GT, GT_red),from = min(x$start), to = max(x$start)+2)

edited Aug 30 '18 at 09:55

answered Aug 30 '18 at 08:32

Roman

17,008
3
36
49

1

The group is what exactly he is looking for. – Onyambu Aug 30 '18 at 08:46
in your edit you are still using `x$grp=.....` This is a manual input of the groups. It should not be the case – Onyambu Aug 30 '18 at 09:09
@Jimbou your very first statement cleary states `#add the grouping` But we do not have the grouping.. we are supposed to create/obtain the grouping. see the answer provided by Uwe – Onyambu Aug 30 '18 at 09:12
@Onyambu I deleted my first answer. Now you can only pass `start` and `end`, then you get the expected result. Happy with that? – Roman Aug 30 '18 at 09:17
This tackles the question. Love it – Onyambu Aug 30 '18 at 09:24
@Jimbou Great! Thank you for updating the chart. I will delete my comments as they are no longer needed. – Uwe Aug 30 '18 at 10:00

identify consecutively overlapping segments in R

2 Answers2

Visualisation

Linked

Related