1

In GenomicRanges one interesting problem is the identification of gene islands.

I am trying to find the largest subset of ranges in which neighboring ranges dont exceed a certain distance. To solve the issue I have tried to assign groups based on the difference between individual ranges.

I searched within IRanges package for suitable methods but I was not successful so far.

daf <- data.frame(seqnames="ConA",start=c(10,39,56,1000,5000),end=c(19,44,87,1100,5050),ID=paste0("Range",LETTERS[1:5]))
gr <- makeGRangesFromDataFrame(daf,keep.extra.columns=TRUE)
## Order the data frame by start column
oo <- order(daf$start)
daf <- daf[oo,]

# Calculate the distance
dd <- c(0,diff(daf$start))
daf$diff <- dd
daf$group <- rep(1,nrow(daf))

group <- 1
for(i in 1:nrow(daf)){
  if(daf$diff[i] > 500){
    group <- group + 1
  }
  daf$group[i] <- group
}

Based on the assigned group one can find the largest one. Do you know any better solution, avoiding the for loop ?

Peter Pisher
  • 457
  • 2
  • 11

1 Answers1

0

I believe this would do what you expect, without the loop:

## Load library
library(GenomicRanges)

## Create the data
gr <- GRanges(seqnames="ConA",
              ranges=IRanges(start=c(10,39,56,1000,5000),
                             end=c(19,44,87,1100,5050),
                             names = paste0("Range",LETTERS[1:5])))

## Get the "start sites" = TSS.
## Use the promoters function if you need to take the strand into account:
grTSS <- promoters(gr, upstream = 0, downstream = 1)

## Create islands in which start sites are located less than 500bp apart
## If distance between start sites is <500bp then ranges of size 501bp centered on start sites overlap
islands <- reduce(grTSS + 250, ignore.strand = TRUE)
names(islands) <- paste0("island", 1:length(islands))

## Get the link between start sites and islands:
ovl <- findOverlaps(grTSS, islands, type = "within", ignore.strand = TRUE)

## Add the group info in the original GRanges object:
gr$group <- names(islands)[subjectHits(ovl)] 

Strand information could be taken into account when creating islands: islands <- reduce(grstarts + 250, ignore.strand = FALSE)
Instead of focusing on start sites / TSS only, we can use gene boundaries (both start and end) directly: islands <- reduce(gr + 250, ignore.strand = TRUE)

Hope this helps

Pascal Martin
  • 311
  • 2
  • 9
  • Thank you for your ideas, but I think there exists a one-line solution, because in my case its only about differences in a single vector – Peter Pisher Mar 05 '20 at 09:44
  • The title of your question sounded more generic, hence my answer that removes the loop and is probably faster on large dataset.You have the one-line solution that you think exist? – Pascal Martin Mar 06 '20 at 16:45
  • We can always make it a one-liner but it will be harder to understand: `gr$group <- (1:length(gr))[subjectHits(findOverlaps(flank(gr, width=-1), reduce(flank(gr, width=-1) + 250)))]` – Pascal Martin Mar 06 '20 at 16:51