0

How do I get a histogram-like summary of interval data in R?

My MWE data has four intervals.

interval  range
Int1      2-7
Int2      10-14
Int3      12-18
Int4      25-28

I want a histogram-like function which counts how the intervals Int1-Int4 span a range split across fixed-size bins. The function output should look like this:

bin     count  which
[0-4]   1      Int1
[5-9]   1      Int1
[10-14] 2      Int2 and Int3
[15-19] 1      Int3
[20-24] 0      None
[25-29] 1      Int4

Here the range is [minfloor(Int1, Int2, Int3, Int40), maxceil(Int1, Int2, Int3, Int4)) = [0,30) and there are six bins of size = 5.

I would greatly appreciate any pointers to R packages or functions that implement the functionality I want.

Update:

So far, I have a solution from the IRanges package which uses a fast data structure called NCList, which is faster than Interval Search Trees according to users.

> library(IRanges)
> subject <- IRanges(c(2,10,12,25), c(7,14,18,28))
> query <- IRanges(c(0,5,10,15,20,25), c(4,9,14,19,24,29))
> countOverlaps(query, subject)
[1] 1 1 2 1 0 1

But I am still unable to get which are the ranges that overlap. Will update if I get through.

nandu
  • 2,563
  • 2
  • 16
  • 14
  • 1
    Have you tried with `table`? – akrun Jun 22 '15 at 11:34
  • I tried table but could not go beyond a naive implementation of querying for every range end point i.e., check [i,j] overlaps [k,l] if j>k and i – nandu Jun 22 '15 at 12:47
  • Is your expected output based on the input example? I would try `cut` with `breaks` – akrun Jun 22 '15 at 12:50
  • How do I get 'cut' to work on intervals? I am under the impression that 'cut' works on vectors. (Sorry about poor formatting.) – nandu Jun 22 '15 at 13:00
  • http://stackoverflow.com/questions/30978837/histogram-like-summary-for-interval-data – user227710 Jun 22 '15 at 13:17

1 Answers1

1

Using IRanges, you should use findOverlaps or mergeByOverlaps instead of countOverlaps. It, by default, doesn't return no matches though.

I'll leave that to you. Instead, will show an alternate method using foverlaps() from data.table package:

require(data.table)
subject <- data.table(interval = paste("int", 1:4, sep=""), 
                      start = c(2,10,12,25), 
                      end = c(7,14,18,28))
query <- data.table(start = c(0,5,10,15,20,25), 
                    end = c(4,9,14,19,24,29))

setkey(subject, start, end)
ans = foverlaps(query, subject, type="any")
ans[, .(count = sum(!is.na(start)), 
        which = paste(interval, collapse=", ")), 
     by = .(i.start, i.end)]

#    i.start i.end count      which
# 1:       0     4     1       int1
# 2:       5     9     1       int1
# 3:      10    14     2 int2, int3
# 4:      15    19     1       int3
# 5:      20    24     0         NA
# 6:      25    29     1       int4
Arun
  • 116,683
  • 26
  • 284
  • 387
  • Any comments on speed of foverlaps Vs. IRanges ? Does foverlaps use a special datastructure/algorithm like NCList ? – nandu Jun 23 '15 at 13:18