11

I am trying to find a way to collapse rows with intersecting ranges, denoted by "start" and "stop" columns, and record the collapsed values into new columns. For example I have this data frame:

my.df<- data.frame(chrom=c(1,1,1,1,14,16,16), name=c("a","b","c","d","e","f","g"), start=as.numeric(c(0,70001,70203,70060, 40004, 50000872, 50000872)), stop=as.numeric(c(71200,71200,80001,71051, 42004, 50000890, 51000952)))


chrom name  start   stop
 1    a        0    71200
 1    b    70001    71200
 1    c    70203    80001
 1    d    70060    71051
14    e    40004    42004
16    f 50000872 50000890
16    g 50000872 51000952

And I am trying to find the overlapping ranges and record the biggest range covered by the collapsed overlapping rows in "start" and "stop" and the names of the collapsed rows, so I would get this:

chrom start   stop      name
 1    70001    80001    a,b,c,d
14    40004    42004    e
16    50000872 51000952 f,g

I think I could use the packages IRanges like this:

library(IRanges)
ranges <- split(IRanges(my.df$start, my.df$stop), my.df$chrom)

But then I have trouble getting the collapsed columns: I have tried with findOvarlaps but this

ov <- findOverlaps(ranges, ranges, type="any")

but I don't think this is right.

Any help would be extremely appreciated.

zx8754
  • 52,746
  • 12
  • 114
  • 209
user971102
  • 3,005
  • 4
  • 30
  • 37
  • I edited the text to reflect the data better by adding the first position at start 0. With either of the approaches suggested chrom 14 is not correctly grouped, can you please help me understand why? Thank you! – user971102 Jun 06 '13 at 09:50
  • Please consider [accepting below answers](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work). – zx8754 Apr 16 '18 at 07:25

3 Answers3

12

IRanges is a good candidate for such job. No need to use chrom variable.

ir <- IRanges(my.df$start, my.df$stop)
## I create a new grouping variable Note the use of reduce here(performance issue)
my.df$group2 <- subjectHits(findOverlaps(ir, reduce(ir)))
# chrom name    start     stop group2
# 1     1    a    70001    71200      2
# 2     1    b    70203    80001      2
# 3     1    c    70060    71051      2
# 4    14    d    40004    42004      1
# 5    16    e 50000872 50000890      3
# 6    16    f 50000872 51000952      3

The new group2 variable is the range indicator. Now using data.table I can't transform my data to the desired output:

library(data.table)
DT <- as.data.table(my.df)
DT[, list(start=min(start),stop=max(stop),
         name=list(name),chrom=unique(chrom)),
               by=group2]

# group2    start     stop  name chrom
# 1:      2    70001    80001 a,b,c     1
# 2:      1    40004    42004     d    14
# 3:      3 50000872 51000952   e,f    16

PS: the collapsed variable name here is not string but a list of factor. This is more efficient and easier to access than a collapased character using paste for example.

EDIT after OP clarification, I will create the group variable by chrom. I mean the Iranges code now is called for each chrom group. I slightly modify your data, to create group of intervals the same chromosome.

my.df<- data.frame(chrom=c(1,1,1,1,14,16,16), 
                   name=c("a","b","c","d","e","f","g"),
                   start=as.numeric(c(0,3000,70203,70060, 40004, 50000872, 50000872)), 
                   stop=as.numeric(c(1,5000,80001,71051, 42004, 50000890, 51000952)))

library(data.table)
DT <- as.data.table(my.df)

## find interval for each chromsom
DT[,group := { 
      ir <-  IRanges(start, stop);
       subjectHits(findOverlaps(ir, reduce(ir)))
      },by=chrom]

## Now I group by group and chrom 
DT[, list(start=min(start),stop=max(stop),name=list(name),chrom=unique(chrom)),
   by=list(group,chrom)]

  group chrom    start     stop name chrom
1:     1     1        0        1    a     1
2:     2     1     3000     5000    b     1
3:     3     1    70060    80001  c,d     1
4:     1    14    40004    42004    e    14
5:     1    16 50000872 51000952  f,g    16
zx8754
  • 52,746
  • 12
  • 114
  • 209
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • @storaged yes really good. To install it , you should do the following `source("http://bioconductor.org/biocLite.R") biocLite("IRanges")` – agstudy Jun 06 '13 at 09:30
  • I edited the main text to reflect better my dataframe, I also have start positions at 0 and if I apply this I don't get the correct overlaps… What am I doing wrong? – user971102 Jun 06 '13 at 09:40
  • @user971102 I edit my answer. I think the problem with creating 0 is that you create a big interval that contains others... – agstudy Jun 06 '13 at 09:54
  • @agstudy, thank you, this works great too!! Both of these approaches now give the correct answers, thanks a lot! – user971102 Jun 06 '13 at 10:30
  • 3
    Using `GRanges` from the GenomicRanges package makes sense here -- `gr <- GRanges(my.df$chrom, IRanges(my.df$start, my.df$stop))` -- then use `gr` instead of `ir` as in the answer. Might as well assign the grouping variable to the GRanges `gr$group <- ...` or even better split the GRanges into a GRangesList `split(gr, subjectHits(findOverlaps(gr, reduce(gr))))` which might seem kind of 'heavy' but actually is relatively memory efficient. – Martin Morgan Jun 06 '13 at 11:46
5

After sorting the data, you can easily test if an interval overlaps the previous one, and assign a label to each set of overlapping intervals. Once you have those labels, you can use ddply to aggregate the data.

d <- data.frame(
  chrom = c(1,1,1,14,16,16), 
  name  = c("a","b","c","d","e","f"), 
  start = as.numeric(c(70001,70203,70060, 40004, 50000872, 50000872)), 
  stop  = as.numeric(c(71200,80001,71051, 42004, 50000890, 51000952))
)

# Make sure the data is sorted
d <- d[ order(d$start), ]

# Check if a record should be linked with the previous
d$previous_stop <- c(NA, d$stop[-nrow(d)])
d$previous_stop <- cummax(ifelse(is.na(d$previous_stop),0,d$previous_stop))
d$new_group <- is.na(d$previous_stop) | d$start >= d$previous_stop

# The number of the current group of records is the number of times we have switched to a new group
d$group <- cumsum( d$new_group )

# We can now aggregate the data
library(plyr)
ddply( 
  d, "group", summarize, 
  start=min(start), stop=max(stop), name=paste(name,collapse=",")
)
#   group    start     stop    name
# 1     1        0    80001 a,d,c,b
# 2     2 50000872 51000952     e,f

But this ignores the chrom column: to account for it, you can do the same thing for each chromosome, separately.

d <- d[ order(d$chrom, d$start), ]
d <- ddply( d, "chrom", function(u) { 
  x <- c(NA, u$stop[-nrow(u)])
  y <- ifelse( is.na(x), 0, x )
  y <- cummax(y)
  y[ is.na(x) ] <- NA
  u$previous_stop <- y
  u
} )
d$new_group <- is.na(d$previous_stop) | d$start >= d$previous_stop
d$group <- cumsum( d$new_group )
ddply( 
  d, .(chrom,group), summarize, 
  start=min(start), stop=max(stop), name=paste(name,collapse=",")
)
#   chrom group    start     stop  name
# 1     1     1        0    80001 a,c,b
# 2    14     2    40004    42004     d
# 3    16     3 50000872 51000952   e,f
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • Thank you, I also have d$start 0, and if I take this it seems to mess up everything and group it in an odd way using this code… (I just edited the main text to reflect this odd behaviour..) – user971102 Jun 06 '13 at 09:34
  • My code only checked if a record should be linked with the previous one, instead of the previous ones. This should be fixed. – Vincent Zoonekynd Jun 06 '13 at 09:58
0

The ivs package can be used for this. It is a specialized package for interval vectors. Use iv_identify_group() to determine the widest ranges per group, then group on that and summarise your name column.

library(dplyr)
library(ivs)

my.df <- data.frame(
  chrom=c(1,1,1,1,14,16,16), 
  name=c("a","b","c","d","e","f","g"), 
  start=as.numeric(c(0,70001,70203,70060, 40004, 50000872, 50000872)), 
  stop=as.numeric(c(71200,71200,80001,71051, 42004, 50000890, 51000952)),
  stringsAsFactors = FALSE
)

my.df <- my.df %>%
  mutate(range = iv(start, stop), .keep = "unused")

my.df %>%
  group_by(chrom) %>%
  mutate(range = iv_identify_group(range)) %>%
  group_by(chrom, range) %>%
  summarise(name = paste0(name, collapse = ","), .groups = "drop")
#> # A tibble: 3 × 3
#>   chrom                range name   
#>   <dbl>            <iv<dbl>> <chr>  
#> 1     1           [0, 80001) a,b,c,d
#> 2    14       [40004, 42004) e      
#> 3    16 [50000872, 51000952) f,g
Davis Vaughan
  • 2,780
  • 9
  • 19