5

I have two data frames and wish to use the value in one (DF1$pos) to search through two columns in DF2 (DF2start, DF2end) and if it falls within those numbers, return DF2$name

DF1

ID   pos  name
chr   12
chr  542
chr  674

DF2

ID   start   end   annot
chr      1   200      a1
chr    201   432      a2
chr    540  1002      a3
chr   2000  2004      a4

so in this example I would like DF1 to become

ID   pos  name
chr   12    a1
chr  542    a3
chr  674    a3

I have tried using merge and intersect but do not know how to use an if statement with a logical expression in them.

The data frames should be coded as follows,

DF1  <- data.frame(ID=c("chr","chr","chr"),
               pos=c(12,542,672),
               name=c(NA,NA,NA))

DF2  <- data.frame(ID=c("chr","chr","chr","chr"),
               start=c(1,201,540,200),
               end=c(200,432,1002,2004),
               annot=c("a1","a2","a3","a4"))
DarrenRhodes
  • 1,431
  • 2
  • 15
  • 29
SemiQuant
  • 158
  • 2
  • 8
  • I didn't vote this question down but I think whoever did so was because you didn't put your data frames in R format – DarrenRhodes Dec 23 '14 at 11:43

2 Answers2

5

Perhaps you can use foverlaps from the "data.table" package.

library(data.table)
DT1 <- data.table(DF1)
DT2 <- data.table(DF2)
setkey(DT2, ID, start, end)
DT1[, c("start", "end") := pos]  ## I don't know if there's a way around this step...
foverlaps(DT1, DT2)
#     ID start  end annot pos i.start i.end
# 1: chr     1  200    a1  12      12    12
# 2: chr   540 1002    a3 542     542   542
# 3: chr   540 1002    a3 674     674   674
foverlaps(DT1, DT2)[, c("ID", "pos", "annot"), with = FALSE]
#     ID pos annot
# 1: chr  12    a1
# 2: chr 542    a3
# 3: chr 674    a3

As mentioned by @Arun in the comments, you can also use which = TRUE in foverlaps to extract the relevant values:

foverlaps(DT1, DT2, which = TRUE)
#    xid yid
# 1:   1   1
# 2:   2   3
# 3:   3   3
DT2$annot[foverlaps(DT1, DT2, which = TRUE)$yid]
# [1] "a1" "a3" "a3"
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • @Arun would there be a `dplyr` way to achieve this ? – Steven Beaupré Dec 23 '14 at 14:07
  • @StevenBeaupré, no overlap joins, but using `between()` - check [this post](http://stackoverflow.com/a/24480301/559784). It materialises the entire join and then filters. – Arun Dec 23 '14 at 14:45
2

You could also use IRanges

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")
library(IRanges)
DF1N <- with(DF1, IRanges(pos, pos))
DF2N <- with(DF2, IRanges(start, end))
DF1$name <- DF2$annot[subjectHits(findOverlaps(DF1N, DF2N))]
DF1
#   ID pos name
#1 chr  12   a1
#2 chr 542   a3
#3 chr 674   a3
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Nice. Using `GenomicRanges` package might be more relevant here though, as `IRanges` only deals with ranges (without identifiers). Or you've to use RangesList I think, but GRanges is much more convenient. The Q indicates only one chromosome, but I doubt it's the case in the real data set. – Arun Dec 23 '14 at 14:46
  • @Arun Thanks for the info. I don't have the `GenomicRanges` installed. Will check it. – akrun Dec 23 '14 at 14:47
  • @Arun Could you add that as a separate answer? – akrun Dec 23 '14 at 14:50
  • thanks @Arun, i think the GenomicRanges is exactly what i was looking for. – SemiQuant Dec 23 '14 at 15:57