overlapping genomic intervals and merging datasets

Question

I want to do something similar to the solution in this thread, where by I have two dataframes and I want to find regions that overlap, and then merge the corresponding data to the hits

>x1
  chr start stop CN
1   1    10  140  G
2   1   100 1000  G
3   1  1500 5000  L



>x2
  chr start stop gene
1   1     1  100    a
2   1   100  150    b
3   1   190 1000    c
4   1  1000 2000    d
5   1  2000 5000    e

I can find the regions that overlap with the following code:

library(GenomicRanges)
gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))
gr2 = with(x2, GRanges(chr, IRanges(start=start, end=stop)))

hits = findOverlaps(gr1, gr2)

with the hits showing the regions in x1 that overlap with x2 e.g:

> hits
Hits of length 8
queryLength: 3
subjectLength: 5
  queryHits subjectHits 
   <integer>   <integer> 
 1         1           1 
 2         1           2 
 3         2           1 
 4         2           2 
 5         2           3 
 6         2           4 
 7         3           4 
 8         3           5

What I would like to do instead would be have the output include both gene and CN info from x1 and x2. The output would look like this

 x1chr x1start x1stop x1CN x2chr x2start x2stop x2gene
1     1      10    140    g     1       1    100      a
2     1      10    140    g     1     100    150      b
3     1     100   1000    g     1       1    100      a
4     1     100   1000    g     1     100    150      b
5     1     100   1000    g     1     190   1000      c
6     1     100   1000    g     1    1000   2000      d
7     1    1500   5000    l     1    1000   2000      d
8     1    1500   5000    l     1    2000   5000      e

score 3 · Answer 1 · answered May 13 '15 at 02:51

You can use foverlaps from data.table package

library(data.table)
setkey(setDT(x1), start, stop)
setkey(setDT(x2), start, stop)
foverlaps(x2, x1)
#   chr start stop CN i.chr i.start i.stop gene
#1:   1    10  140  G     1       1    100    a
#2:   1   100 1000  G     1       1    100    a
#3:   1    10  140  G     1     100    150    b
#4:   1   100 1000  G     1     100    150    b
#5:   1   100 1000  G     1     190   1000    c
#6:   1   100 1000  G     1    1000   2000    d
#7:   1  1500 5000  L     1    1000   2000    d
#8:   1  1500 5000  L     1    2000   5000    e

score 2 · Answer 2 · answered May 13 '15 at 02:52

2

I have managed to find a very simple solutions. Using the code:

x<-cbind(x1[queryHits(hits),],x2[subjectHits(hits),])

this provides the desired as output

answered May 13 '15 at 02:52

George

903
8
22

score 0 · Answer 3 · answered Aug 30 '15 at 10:50

0

If you are under system of linux or mac, you can install bedtools(http://bedtools.readthedocs.org/en/latest/index.html). And then use commands "intersectBed -a fileA.txt -b fileB.txt -wa -wb > youroutputfile.txt". You will get the result file with both dataframe A and dataframe B. It would be much faster and popular to use bedtools to deal with high-throughput datasets.

answered Aug 30 '15 at 10:50

adamtongji

1
2

Unfortunately my work insists we use windows. However, I have used bedtools before and Galaxy.org also has tools that allow this. – George Sep 02 '15 at 02:55

overlapping genomic intervals and merging datasets

3 Answers3