2

This is I assume a somewhat simple programming issue, but I've been struggling with it. Mostly because I don't know the right words to use, perhaps?

Given a set of "ranges" (in the form of 1-a set of numbers as below, 2-IRanges, or 3-GenomicRanges), I'd like to split it into a set of smaller ranges.

Example Beginning:

Chr    Start     End
1        1        10000
2        1        5000

Example size of breaks: 2000

New dataset:

Chr    Start    End
1        1       2000
1        2001    4000
1        4001    6000
1        6001    8000
1        8001    10000
2        1       2000
2        2001    4000
2        4001    5000

I'm doing this in R. I know I could generate these simply with seq, but I'd like to be able to do it based on a list/df of regions instead of having to manually do it every time I have a new list of regions.

Here's an example I've made using seq:

Given 22 chromosomes, loop through them and break each into pieces

# initialize df
Regions <- data.frame(Chromosome = c(), Start = c(), End = c())
# for each row, do the following
for(i in 1:nrow(Chromosomes)){
     # create a sequence from the minimum start to the max end by some value
     breks <- seq(min(Chromosomes$Start[Chromosomes$Chromosome == i]), max(Chromosomes$End[Chromosomes$Chromosome == i]), by=2000000)

     # put this into a dataframe
     database <- data.frame(Chromosome = i, Start = breks, End = c(breks[2:length(breks)]-1, max(Chromosomes$End[Chromosomes$Chromosome == i])))

     # bind with what we already have
     Regions <- rbind(Regions, database)
     rm(database)
}

This works fine, I'm wondering if there is something built into a package already to do this as a one-liner OR that is more flexible, as this has its limitations.

Gaius Augustus
  • 940
  • 2
  • 15
  • 37
  • So, to be clear, your goal is a function that takes in the data frame you show as "Example Beginning" along with the argument `breaks = 2000` and outputs the "New dataset"? If so, I agree. You could `seq` very easily - just do it in terms of variables and wrap it in `function(){}`, then you have your own custom function. – Gregor Thomas Sep 12 '16 at 21:18
  • I would go with `seq` kind of solution, may I ask why are we doing this? – zx8754 Sep 12 '16 at 21:19
  • Maybe sth like `library(dplyr); library(tidyr); breaks <- 2000L; df %>% group_by(Chr) %>% expand(Start = seq(Start, End, breaks), End = End) %>% mutate(End = if_else(Start+breaks>End, End, as.integer(Start+breaks-1)))`. But there are prly more elegant solutions to this problem. – lukeA Sep 12 '16 at 21:26
  • I'm working on a seq based solution that I'll post once it works properly, but was hoping there was something simpler. Thanks! @zx8754, I'm going to be looping through these regions and apply a custom function I wrote to them. It's not relevant to this question, but I'll be counting up frequencies of different events in each region so I can make a plot of frequencies in each region. – Gaius Augustus Sep 12 '16 at 21:45
  • You've post a problem where the interval width is constant at 2000 for all chromosomes. If you want something more complex, then you need to define the "dimensions" of that extra complexity which so far is missing. For instance if each chromosome has a potentially different width then add a new column to your example. Do edit the question, rather than responding in comments. – IRTFM Sep 13 '16 at 23:36
  • I didn't want something more complex (than a constant interval width). I just was hoping that a more elegant and robust function already existed to do it than the rudimentary function I created, which has its limitations as is. – Gaius Augustus Sep 15 '16 at 20:54

1 Answers1

3

Using the R / Bioconductor package GenomicRanges, here are your initial ranges

library(GenomicRanges)
rngs = GRanges(1:2, IRanges(1, c(10000, 5000)))

and then create a sliding window across the genome, generated first as a list (one set of tiles per chromosome) and then unlisted for the format you have in your question

> windows = slidingWindows(rngs, width=2000, step=2000)
> unlist(windows)
GRanges object with 8 ranges and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]        1 [   1,  2000]      *
  [2]        1 [2001,  4000]      *
  [3]        1 [4001,  6000]      *
  [4]        1 [6001,  8000]      *
  [5]        1 [8001, 10000]      *
  [6]        2 [   1,  2000]      *
  [7]        2 [2001,  4000]      *
  [8]        2 [4001,  5000]      *

  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Coerce from / to a data.frame with as(df, "GRanges") or as(unlist(tiles), "data.frame").

Find help at ?"slidingWindows,GenomicRanges-method" (tab completion is your friend, ?"slidingW<tab>).

Embarrassingly, this seems to be implemented only in the 'devel' version of GenomicRanges (v. 1.25.93?); tile does something similar but rounds the width of ranges to be approximately equal while spanning the width of the GRanges. Here is a poor-man's version

windows <- function(gr, width, withMcols=FALSE) {
    starts <- Map(seq, start(rngs), end(rngs), by=width)
    ends <- Map(function(starts, len) c(tail(starts, -1) - 1L, len),
                starts, end(gr))
    seq <- rep(seqnames(gr), lengths(starts))
    strand <- rep(strand(gr), lengths(starts))
    result <- GRanges(seq, IRanges(unlist(starts), unlist(ends)), strand)
    seqinfo(result) <- seqinfo(gr)
    if (withMcols) {
        idx <- rep(seq_len(nrow(gr)), lengths(starts))
        mcols(result) = mcols(gr)[idx,,drop=FALSE]
    }
    result
}

invoked as

> windows(rngs, 2000)

If the approach is useful, consider asking follow-up questions on the Bioconductor support site.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • 2
    I knew there was some function in Granges, but couldn't easily Google for it, nor could find it in the manuals. Do you mind adding a link to the manual that documents this function? Is it part of GenomicRanges/IRanges, can't seem to find it in help `??slidingWindows`? – zx8754 Sep 13 '16 at 06:33
  • 2
    @zx8754 oops, sorry, it seems like this is only available in the devel version of GenomicRanges; I provided an ad hoc solution in the answer. – Martin Morgan Sep 13 '16 at 09:58