0

I have a data.frame1 like:

1    bin chrom chromStart  chromEnd    name score
2     12  chr1   29123222  29454711 -5.7648   599
3    116  chr1   45799118  45986770 -4.8403   473
4    117  chr1   46327104  46490961 -5.3036   536
5    121  chr1   50780759  51008404 -4.4165   415
6    133  chr1   63634657  63864734 -4.8096   469
7    147  chr1   77825305  78062178 -5.4671   559

I also have a data.frame2 like:

  chrom chromStart chromEnd    N
1  chr1    63600000  63700000 1566
2  chr1    45800000  45900000 1566
3  chr1    29100000  29400000 1566
4  chr1    50400000  50500000 1566
5  chr1    46500000  46600000 1566

Basically I have ranges of values from chromStart to chromEnd in data.frame1. I want to cut those ranges down to only ranges that overlap with my ranges in data.frame2. For example, the first range of df1is 2912322 to 29454711. I would like to cut that range down to 2912322 to 29400000 because that is the only range that overlaps with a range from df2. Does anyone know how I could do this?

The output I want is a data.frame like:

    1    bin chrom chromStart  chromEnd    name score
    2     12  chr1   29123222  29400000 -5.7648   599
    3    116  chr1   45800000  45900000 -4.8403   473
    6    133  chr1   63634657  63700000 -4.8096   469

Here is what the current output gives me for a data.frame:

  chrom chromStart chromEnd bin    name score
1  chr1   29123222 29130000  12 -5.7648   599
2  chr1   29123222 29140000  12 -5.7648   599
3  chr1   29123222 29150000  12 -5.7648   599
4  chr1   29123222 29160000  12 -5.7648   599
5  chr1   29123222 29170000  12 -5.7648   599
Evan
  • 1,477
  • 1
  • 17
  • 34
  • What would be the behavior if a data.frame1 line is 1 to 9 and the ranges 1 to 3 and 6 to 7 are in data.frame2? – user1470500 Oct 04 '16 at 02:29
  • Then I would want ranges 1 to 3 and ranges 6 to 7 kept. Maybe it would be best to trim down the data.frame2? – Evan Oct 04 '16 at 02:34
  • You could merge any interval that overlap in data.frame2 in one interval. Then sort the list of interval you got by starting points at this point checking the intersection should be easier. – user1470500 Oct 04 '16 at 02:50
  • How would that be done? – Evan Oct 04 '16 at 02:53
  • 2
    You might try `data.table::foverlaps` or `IRanges::findOverlaps`. – alistaire Oct 04 '16 at 03:55

1 Answers1

3

+1 for suggesting IRanges::findOverlaps.

Here's a solution using findOverlaps and GenomicRanges:

library(GenomicRanges);

df1 <- cbind.data.frame(
    bin = c(12, 116, 117, 121, 133, 147),
    chrom = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
    chromStart = c(29123222, 45799118, 46327104, 50780759, 63634657, 77825305),
    chromEnd = c(29454711, 45986770, 46490961, 51008404, 63864734, 78062178),
    name = c(-5.7648, -4.8403, -5.3036, -4.4165, -4.8096, -5.4671),
    score = c(599, 473, 536, 415, 469, 559));

df2 <- cbind.data.frame(
    chrom = c("chr1", "chr1", "chr1", "chr1", "chr1"),
    chromStart = c(63600000, 45800000, 29100000, 50400000, 46500000),
    chromEnd = c(63700000, 45900000, 29400000, 50500000, 46600000),
    N = c(1566, 1566, 1566, 1566, 1566));

# Make GRanges objects from dataframes
gr1 <- with(df1, GRanges(
    chrom, 
    IRanges(start = chromStart, end = chromEnd), 
    bin = bin, 
    name = name, 
    score = score));

gr2 <- with(df2, GRanges(
    chrom,
    IRanges(start = chromStart, end = chromEnd),
    N = N));

# Get overlapping features
hits <- findOverlaps(query = gr1, subject = gr2);

# Get features from gr1 that overlap with features from gr2
idx1 <- queryHits(hits);
idx2 <- subjectHits(hits);
gr <- gr1[idx1];

# Make sure that we only keep the intersecting ranges
start(gr) <- ifelse(start(gr) >= start(gr2[idx2]), start(gr), start(gr2[idx2]));
end(gr) <- ifelse(end(gr) <= end(gr2[idx2]), end(gr), end(gr2[idx2]));

print(gr);

GRanges object with 3 ranges and 3 metadata columns:
      seqnames               ranges strand |       bin      name     score
         <Rle>            <IRanges>  <Rle> | <numeric> <numeric> <numeric>
  [1]     chr1 [29123222, 29400000]      * |        12   -5.7648       599
  [2]     chr1 [45800000, 45900000]      * |       116   -4.8403       473
  [3]     chr1 [63634657, 63700000]      * |       133   -4.8096       469
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

# Turn GRanges into a dataframe
df <- data.frame(bin = mcols(gr)$bin, 
                 chrom = seqnames(gr), 
                 chromStart = start(gr), 
                 chromEnd = end(gr), 
                 name = mcols(gr)$name, 
                 score = mcols(gr)$score);
print(df);  
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • This is close to what I want but it's giving me many ranging from 2912322 to 29300000, 29400000, 29500000. Also is it possible to get this output in a data.frame? – Evan Oct 05 '16 at 18:31
  • Not sure what you mean by "giving me many ranging from ...". This method attempts to find an overlapping region in df2 *for every feature* in df1. Isn't this what you wanted? It reproduces the example you give above. You can get a dataframe simply by `data.frame(chrom = seqnames(gr), chromStart = start(gr), chromEnd = end(gr), bin = mcols(gr)$bin, name = mcols(gr)$name, score = mcols(gr)$score)`. – Maurits Evers Oct 05 '16 at 20:18
  • I put the output from your code in my most recent edit. I would just like the largest range that exists for each range that is being trimmed in df1 – Evan Oct 06 '16 at 00:17
  • I'm afraid I can't reproduce your edited output. If I run the code from my answer I get exactly the output you seem to be after, albeit as a `GRanges` object rather than as a `data.frame`. I have edited my answer to produce a data.frame with exactly the same column ordering as in your original post. Please take a look and let me know if you still get something different. – Maurits Evers Oct 06 '16 at 06:28