2

Let say I've a list of non-overlapping genomic intervals.

chr1    1   100
chr1    101 200
chr1    201 300
chr1    301 400

and a list of genomic positions linked to different samples as :

chr1    50  sampleA
chr1    60  sampleB
chr1    110 sampleA
chr1    130 sampleB
chr1    160 sampleA
chr1    190 sampleC
chr1    350 sampleB
chr1    360 sampleB

My goal is to count the number of unique samples for each interval. In my real dataset the interval table is ~400.000 lines and the genomic position-sample table is ~30.000 lines.

This calculation is embedded in a simulation thus it should be as fast as possible. I already tried with GenomicRanges as :

require(GenomicRanges)
interval.gr <- GRanges(intervals$chr,IRanges(intervals$start,intervals$end))
positions.gr <- GRanges(positions$chr,IRanges(positions$pos,positions$pos))
ov <- findOverlaps(interval.gr,positions.gr)
intervals %>%
  slice(queryHits(ov)) %>%
  mutate(sample=positions$sample[subjectHits(ov)]) %>% 
  group_by(chr,start,end) %>% 
  summarise(n_sample=length(unique(sample)))

results in

# A tibble: 3 x 4
# Groups:   chr, start [3]
  chr   start   end n_sample
  <fct> <dbl> <dbl>    <int>
1 chr1      1   100        2
2 chr1    101   200        3
3 chr1    301   400        1

However it still drop intervals without samples in it (201-300) and it is also not very fast. Using my dataset :

Unit: milliseconds
 expr      min      lq     mean   median       uq      max neval
    x 159.3901 161.621 190.1703 164.4879 168.3116 297.8395    10

I'm wondering if there is better and faster ways to do this kind of analysis ?

Thanks


reproducible datasets :

intervals <- data.frame(chr=c("chr1","chr1","chr1","chr1"),start=c(1,101,201,301),end=c(100,200,300,400))

positions <- data.frame(chr=rep("chr1",8),pos=c(50,60,110,130,160,190,350,360),sample=c("sampleA","sampleB","sampleA","sampleB","sampleA","sampleC","sampleB","sampleB"))

edit

reproducible datasets with the same size as my real dataset

intervals <- data.frame(chr=paste0("chr",round(runif(400000,min = 1,max = 22))),start=round(runif(n = 400000,min = 1,max = 100000000)))
intervals$end <- intervals$start+100

positions <- data.frame(chr=paste0("chr",round(runif(30000,min = 1,max = 22))),pos=round(runif(n = 30000,min = 1,max = 100000000)),sample=sample(paste0("sample",1:400),size = 30000,replace=T))
Nicolas Rosewick
  • 1,938
  • 4
  • 24
  • 42

2 Answers2

2

Building on what @Jon said, data.table is a great way to approach this. Using the function foverlaps() greatly improves speed.

library(data.table)
intervals <- data.frame(chr=c("chr1","chr1","chr1","chr1"),
                        start=c(1,101,201,301),
                        end=c(100,200,300,400))

positions <- data.frame(chr=rep("chr1",8),
                        pos=c(50,60,110,130,160,190,350,360),
                        sample=c("sampleA","sampleB","sampleA","sampleB","sampleA","sampleC","sampleB","sampleB"))
setDT(positions)
setDT(intervals)

  positions[, pos_tmp := pos]
  setkey(positions,chr, pos, pos_tmp)
  overlap = foverlaps(intervals, positions, type="any",by.x=c("chr","start", "end")) ## return overlap indices
  overlap[!is.na(sample),.(n_sample = .N), by = .(chr, start, end)]

Compared to @Jon's implementation which takes ~6 seconds on my machine, the above implementation takes ~180 milliseconds

Bryan
  • 51
  • 5
  • Hi @Bryan. It's indeed very fast ! thanks. However how can I keep the intervals with 0 samples ? – Nicolas Rosewick Sep 25 '19 at 13:23
  • @Bryan the foverlaps approach is really good! n_sample = .N can generate issues if there are sample duplicates. You should use length(unique()) – Jon Nagra Sep 25 '19 at 14:39
  • 1
    @Jon good point the above sample would count a sample for each pos value it has, if you want a count of distinct samples data.table has a nice built-in for length(unique()) called uniqueN() here's an example of it with request for empty groups `overlap[,.(n_sample = uniqueN(sample[!is.na(sample)])), by = .(chr, start, end)] ` – Bryan Sep 25 '19 at 15:40
0

For a large sample I'd try to use a non-equi join in data.table. The benchmark in the sample is worse than tmfmnk's answer but for abigger dataset it will likely have a smaller memory footprint and be faster.

library(data.table)
library(microbenchmark)

intervals <-
  data.table(
    chr = c("chr1", "chr1", "chr1", "chr1"),
    start = c(1, 101, 201, 301),
    end = c(100, 200, 300, 400)
  )
positions <-
  data.table(
    chr = rep("chr1", 8),
    pos = c(50, 60, 110, 130, 160, 190, 350, 360),
    sample = c(
      "sampleA",
      "sampleB",
      "sampleA",
      "sampleB",
      "sampleA",
      "sampleC",
      "sampleB",
      "sampleB"
    )
  )


# This takes only the ones we need:
positions[intervals, on = .(chr == chr, pos <= end, pos >= start), .(
  chr = i.chr,
  start = i.start,
  end = i.end,
  sample = x.sample
)][, .(n_sample = sum(uniqueN(sample, na.rm = TRUE))), by = c("chr", "start", "end")]


# Benchmarking
microbenchmark(positions[intervals, on = .(chr == chr, pos <= end, pos >= start), .(
  chr = i.chr,
  start = i.start,
  end = i.end,
  sample = x.sample
)][, .(n_sample = sum(uniqueN(sample, na.rm = TRUE))), by = c("chr", "start", "end")]
, times = 50)

Let me know if it does the work. I included a few changes to create a example closer to your use case. Also included Bryan's case for benchmarking. Still your approach is the fastest. I had hopes in the 4th approach (only 2 keys) but it still is slower.

library(data.table)
library(microbenchmark)
library(GenomicRanges)
set.seed(14)
intervals <-
    data.table(chr = paste0("chr", round(runif(
        400000, min = 1, max = 22
    ))), start = round(runif(
        n = 400000, min = 1, max = 100000000
    )))
# modified to start always with 1
intervals[, start := floor(start / 100) * 100 + 1L]

intervals$end <- intervals$start + 100L

# erase duplicates
intervals <-
    unique(intervals[, .SD, .SDcols = c("chr", "start", "end")])

positions <-
    data.table(
        chr = paste0("chr", round(runif(
            30000, min = 1, max = 22
        ))),
        pos = round(runif(
            n = 30000, min = 1, max = 100000000
        )),
        sample = sample(paste0("sample", 1:400), size = 30000, replace = T)
    )

setkeyv(intervals, c("chr", "start", "end"))
positions[, pos2 := pos]
setkeyv(positions, c("chr", "pos", "pos2"))


microbenchmark({
    interval.gr <-
        GRanges(intervals$chr, IRanges(intervals$start, intervals$end))
    positions.gr <-
        GRanges(positions$chr, IRanges(positions$pos, positions$pos))
    ov <- findOverlaps(interval.gr, positions.gr)
    res1 <- intervals %>%
        slice(queryHits(ov)) %>%
        mutate(sample = positions$sample[subjectHits(ov)]) %>%
        group_by(chr, start, end) %>%
        summarise(n_sample = length(unique(sample))) %>% data.table(.)

    intervals_1 <-
        rbind(res1, intervals, fill = TRUE)[, sum(n_sample, na.rm = TRUE), by = c("chr", "start", "end")]
}, times = 50)
#> Unit: milliseconds
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 expr
#>  {     interval.gr <- GRanges(intervals$chr, IRanges(intervals$start,          intervals$end))     positions.gr <- GRanges(positions$chr, IRanges(positions$pos,          positions$pos))     ov <- findOverlaps(interval.gr, positions.gr)     res1 <- intervals %>% slice(queryHits(ov)) %>% mutate(sample = positions$sample[subjectHits(ov)]) %>%          group_by(chr, start, end) %>% summarise(n_sample = length(unique(sample))) %>%          data.table(.)     intervals_1 <- rbind(res1, intervals, fill = TRUE)[, sum(n_sample,          na.rm = TRUE), by = c("chr", "start", "end")] }
#>       min       lq     mean   median       uq      max neval
#>  148.0612 164.4884 225.8665 178.5495 219.6038 585.5995    50


microbenchmark({
    # This takes only the ones we need:
    res2 <-
        positions[intervals, on = .(chr == chr, pos <= end, pos >= start), nomatch = 0L, .(
            chr = i.chr,
            start = i.start,
            end = i.end,
            sample = x.sample
        )][, .(n_sample = length(unique(sample, na.rm = TRUE))), by = c("chr", "start", "end")]
    intervals_2 <-
        rbind(res2, intervals, fill = TRUE)[, sum(n_sample, na.rm = TRUE), by = c("chr", "start", "end")]

}, times = 50)
#> Unit: milliseconds
#>                                                                                                                                                                                                                                                                                                                                                                                                             expr
#>  {     res2 <- positions[intervals, on = .(chr == chr, pos <= end,          pos >= start), nomatch = 0L, .(chr = i.chr, start = i.start,          end = i.end, sample = x.sample)][, .(n_sample = length(unique(sample,          na.rm = TRUE))), by = c("chr", "start", "end")]     intervals_2 <- rbind(res2, intervals, fill = TRUE)[, sum(n_sample,          na.rm = TRUE), by = c("chr", "start", "end")] }
#>       min       lq     mean   median       uq      max neval
#>  294.4081 338.0309 447.0925 383.2813 548.3734 883.4389    50

microbenchmark({
    # Bryan's approach
    overlap = foverlaps(intervals,
                        positions,
                        type = "any",
                        by.x = c("chr", "start", "end")) ## return overlap indices
    res3 <-
        overlap[!is.na(sample), .(n_sample = length(unique(sample))), by = .(chr, start, end)]
    intervals_3 <-
        rbind(res3, intervals, fill = TRUE)[, sum(n_sample, na.rm = TRUE), by = c("chr", "start", "end")]

}, times = 50)
#> Unit: milliseconds
#>                                                                                                                                                                                                                                                                                                                                                 expr
#>  {     overlap = foverlaps(intervals, positions, type = "any", by.x = c("chr",          "start", "end"))     res3 <- overlap[!is.na(sample), .(n_sample = length(unique(sample))),          by = .(chr, start, end)]     intervals_3 <- rbind(res3, intervals, fill = TRUE)[, sum(n_sample,          na.rm = TRUE), by = c("chr", "start", "end")] }
#>      min       lq    mean   median       uq      max neval
#>  221.907 280.8795 429.961 398.0075 475.0276 1017.866    50


## Starting by 1

set.seed(14)
intervals <-
    data.table(chr = paste0("chr", round(runif(
        400000, min = 1, max = 22
    ))), start = round(runif(
        n = 400000, min = 1, max = 100000000
    )))
# modified to start always with 1
intervals[, start := floor(start / 100) * 100 + 1L]


positions[, start := floor(pos / 100) * 100 + 1L]
positions <-
    unique(positions[, .SD, .SDcols = c("chr", "start", "sample")])
setkeyv(intervals, c("chr", "start"))
setkeyv(positions, c("chr", "start"))
microbenchmark({
    intervals_4 <-
        positions[intervals][, .(n_sample = sum(!is.na(sample))), by = c("chr", "start")]#
    # add end
    intervals_4[, end := start + 100L]
}, times = 50)
#> Unit: milliseconds
#>                                                                                                                                                          expr
#>  {     intervals_4 <- positions[intervals][, .(n_sample = sum(!is.na(sample))),          by = c("chr", "start")]     intervals_4[, `:=`(end, start + 100L)] }
#>       min       lq     mean   median       uq      max neval
#>  419.1001 443.9567 546.0186 478.5985 650.4682 1115.179    50

Created on 2019-09-25 by the reprex package (v0.3.0)

Jon Nagra
  • 1,538
  • 1
  • 16
  • 36
  • thank you @Jon. I tried your solution. It's indeed faster on the test dataset but not on the real dataset (my solution is still the fastest one). I will edit my question to add a simulated dataset with the same size as the real dataset. It will be easier for benchmarking. – Nicolas Rosewick Sep 25 '19 at 11:54
  • 1
    or simpler: `positons[, c("start", "end") := .(pos, pos)][intervals, on=.(chr, start>=start, end<=end), .N, by=.EACHI, nomatch=0L]` should work I think ... – Arun Sep 26 '19 at 18:02