7

I have two files

encode

X.pattern.name       chr       start    stop    strand  score   p.value q.value matched.sequence
1   V_CETS1P54_01   chr1    98769545    98769554    +   11.42280    8.89e-05    NA  TCAGGATGTA
2   V_CETS1P54_01   chr1    152013037   152013046   +   11.98020    4.74e-05    NA  ACAGGAAGTT
3   V_CETS1P54_01   chr1    168932563   168932572   +   11.60860    7.59e-05    NA  ACCGGATGCT

encode.total

    chr     start           stop
1   chr1    58708485    58708713
2   chr1    58709084    58710538
3   chr1    98766295    98766639
4   chr1    98766902    98770338
5   chr1    107885506   107889414
6   chr1    138589531   138590856
7   chr1    138591180   138593378
8   chr1    152011746   152013185
9   chr1    152014263   152014695
10  chr1    168930561   168933076
11  chr1    181808064   181808906
12  chr1    184609002   184611519
13  chr1    193288453   193289567
14  chr1    193290105   193290490
15  chr1    193290744   193291092
16  chr1    196801920   196804954

I want to compare the two files, each entry by chr, start and stop. If the start and stop values in first file falls between the start and stop of the second file for the same chromosome, then that start & stop values in the first file should be replaced by the second file's start & stop values. I have written a for loop for that purpose but it is taking too long. What are the alternatives?

Code:

for(i in 1:nrow(encode))
{
     for(j in 1:nrow(encode.total))
     {
           if(encode[i,2]==encode.total[j,1])
           {
               if((encode[i,3]>=encode.total[j,2]) & (encode[i,4]<=encode.total[j,3]))
               {
                   encode[i,3]=encode.total[j,2]
                   encode[i,4]=encode.total[j,3]
               }
           }
     }
}

For the same purpose I also tried the GenomicRanges package like below. The size of my dataframes is huge and the merge function on them creates a VERY LARGE dataframe (>2 billion rows which is not allowed), although I eventually subset the dataframe in a smaller one. But merge() is taking a LOT of memory and terminating R.

gr1<-GRanges(seqnames=encode$chr,IRanges(start=encode$start,end=encode$end))
gr2<-GRanges(seqnames=encode.total$chr, IRanges(start=encode.total$start,end=encode.total$end))

ranges <- merge(as.data.frame(gr1),as.data.frame(gr2),by="seqnames",suffixes=c("A","B"))
ranges <- ranges[with(ranges, startB <= startA & endB >= endA),]
Komal Rathi
  • 4,164
  • 13
  • 60
  • 98
  • In your example, the start and stop values in both files are identical, so nothing would be changed... – juba Sep 30 '13 at 19:08
  • I have edited my files, please check now. – Komal Rathi Sep 30 '13 at 19:20
  • 1
    Duplicate? http://stackoverflow.com/questions/3916195/finding-overlap-in-ranges-with-r – Henrik Sep 30 '13 at 19:35
  • @Henrik The question you are pointing to works with Ranges object & I have a dataframe object. I am not familiar with Ranges object in R. – Komal Rathi Sep 30 '13 at 19:41
  • 1
    The OP in the question starts with "two data.frames each with three columns: chrom, start & stop". To me it seems quite similar to your data. – Henrik Sep 30 '13 at 19:44
  • @Henrik If you observe, that data only has unique values in the chr field whereas my data contains multiple entries for each chr. It is completely different than that data because the answer applies to data where chr field is unique. – Komal Rathi Sep 30 '13 at 20:16
  • OK! But hopefully you can get some inspiration from the answers. – Henrik Sep 30 '13 at 20:19
  • I tried implementing the solution, by modifying it accordingly. But it is not giving me the desired output. Like I said, I have the for-loop up and running but as my files are large, I am looking for a faster solution. Thanks. – Komal Rathi Sep 30 '13 at 20:22

1 Answers1

14

Use the Bioconductor GenomicRanges package.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GenomicRanges")

Suppose there are two data frames x0 and x1, like encode and encode.total in the original example. We'd like to make these into a GRanges instance. I did this with

library(GenomicRanges)
gr0 = with(x0, GRanges(chr, IRanges(start=start, end=stop)))
gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))

It will often be possible to simply say makeGRangesFromDataFrame(x0), or use standard R commands to create a GRanges instance 'by hand'. Here we use with(), so that we can write GRanges(chr, IRanges(start=start, end=stop)) instead of GRanges(x0$chr, IRanges(start=x0$start, end=x0$stop)).

The next step is to find overlaps between query (gr0) and subject (gr1)

hits = findOverlaps(gr0, gr1)

which leads to

> hits
Hits of length 3
queryLength: 3
subjectLength: 16
  queryHits subjectHits 
   <integer>   <integer> 
 1         1           4 
 2         2           8 
 3         3          10 

Then updated the relevant start / end coordinates

ranges(gr0)[queryHits(hits)] = ranges(gr1)[subjectHits(hits)]

giving

> gr0
GRanges with 3 ranges and 0 metadata columns:
      seqnames                 ranges strand
         <Rle>              <IRanges>  <Rle>
  [1]     chr1 [ 98766902,  98770338]      *
  [2]     chr1 [152011746, 152013185]      *
  [3]     chr1 [168930561, 168933076]      *
  ---
  seqlengths:
   chr1
     NA

This will be fast up into the millions of ranges.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • This worked like a charm! Very useful and extremely efficient – Komal Rathi Oct 01 '13 at 01:54
  • What's the point of `with(x0...` and with `with(x1...`. What what are `x0` and `x1` anyway. – Omar Wagih Apr 13 '15 at 08:49
  • @by0 I added to the answer to try to explain x0 and x1 (two data frames containing genomic range information) and `with()` (a base R function that makes it easier to reference columns in the data.frame that is its first argument). – Martin Morgan Apr 13 '15 at 16:40
  • @MartinMorgan Can you please answer the following question? Thanks. https://stackoverflow.com/questions/48187739/r-processing-the-input-file-based-on-range-overlap – Newbie Jan 10 '18 at 17:29