3

Having segmented a genome into adjacent non-overlapping bins, e.g. by tileGenome, I have computed some property by some means for each bin (say either 1 or 2).

Now I want to merge adjacent with the same property. An minimal example is illustrated below:

library(GenomicRanges)
chrSizes <- c(chr1 = 1000, chr2 = 500)
bins   <- tileGenome(chrSizes, tilewidth = 200, cut.last.tile.in.chrom = T)
bins$property <- rep(1:2, each = 4)
bins
GRanges object with 8 ranges and 1 metadata column:
      seqnames    ranges strand |  property
         <Rle> <IRanges>  <Rle> | <integer>
  [1]     chr1     1-200      * |         1
  [2]     chr1   201-400      * |         1
  [3]     chr1   401-600      * |         1
  [4]     chr1   601-800      * |         1
  [5]     chr1  801-1000      * |         2
  [6]     chr2     1-200      * |         2
  [7]     chr2   201-400      * |         2
  [8]     chr2   401-500      * |         2
  -------
  seqinfo: 2 sequences from an unspecified genome

The first 4 bins have property 1, so should be combined into one bin.

I have looked through the GRanges documentation and can't find an obvious native solution. Note, seqname boundaries have to be taken into account (e.g. chr1 and chr2 remain separated irrespective of the property) Obviously, I one could use a loop, but I'd rather use a native GRange solutions e.g. using union which I might have overseen.

The desired output should look something like this:

      seqnames    ranges strand |  property
         <Rle> <IRanges>  <Rle> | <integer>
  [1]     chr1     1-800      * |         1
  [2]     chr1  801-1000      * |         2
  [3]     chr2     1-500      * |         2

1 Answers1

3

R GenomicRanges:

result <- unlist(reduce(split(bins, ~property)))
result$property <- names(result)

# GRanges object with 3 ranges and 1 metadata column:
# seqnames    ranges strand |    property
# <Rle> <IRanges>  <Rle> | <character>
# 1     chr1     1-800      * |           1
# 2     chr1  801-1000      * |           2
# 2     chr2     1-500      * |           2
# -------
# seqinfo: 2 sequences from an unspecified genome

Python PyRanges:

import pandas as pd
from io import StringIO
import pyranges as pr

c = """Chromosome Start End Value
chr1 1 200 Python
chr1 201 400 Python
chr1 401 600 Python
chr1 601 800 Python
chr1 801 1000 R
chr2 1 200 R
chr2 201 400 R
chr2 401 500 R"""

df = pd.read_table(StringIO(c), sep=" ")
gr = pr.PyRanges(df)
gr.merge(by="Value", slack=1)

# +--------------+-----------+-----------+------------+
# | Chromosome   |     Start |       End | Value      |
# | (category)   |   (int32) |   (int32) | (object)   |
# |--------------+-----------+-----------+------------|
# | chr1         |         1 |       800 | Python     |
# | chr1         |       801 |      1000 | R          |
# | chr2         |         1 |       500 | R          |
# +--------------+-----------+-----------+------------+
# Unstranded PyRanges object has 3 rows and 4 columns from 2 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
  • Thats exactly what I was looking for! Maybe, the code can be even a bit streamlined like this: `result <- unlist(reduce(split(bins, ~property))); result$score <- names(result)`. Feel free to update the answer if you agree. – Sebastian Müller Nov 15 '19 at 16:35
  • Yes, thanks. R is not my main language, so I learnt something new from your comment :) – The Unfun Cat Nov 18 '19 at 09:05