0

I have a reference table where each row contains an interval (col1, col2) and 2 other values (color: "red", "blue", direction: "+", "-"), such as the below interv

interv1 <- cbind(seq(from = 3, to = 40, by = 4),seq(from = 5, to = 50, by = 5), c(rep("blue",5), rep("red", 5)), rep("+",10))
interv2 <- cbind(seq(from = 3, to = 40, by = 4),seq(from = 5, to = 50, by = 5), c(rep("blue",5), rep("red", 5)), rep("-",10))
interv  <- rbind(interv1, interv2)

     [,1] [,2] [,3]   [,4]
[1,] "3"  "5"  "blue" "+" 
[2,] "7"  "10" "blue" "+" 
[3,] "11" "15" "blue" "+" 
[4,] "15" "20" "blue" "+" 
[5,] "19" "25" "blue" "+" 
[6,] "23" "30" "red"  "+" 

I also have a table of interest that has specific position included in intervals of the first table plus the color and direction variable.

to_match <- cbind(rep(seq(from = 4, to = 43, by = 4),2), rep(c(rep("blue", 5), rep("red", 5)), 2), c(rep("-", 10), rep("+", 10)))

     [,1] [,2]   [,3]
[1,] "4"  "blue" "-" 
[2,] "8"  "blue" "-" 
[3,] "12" "blue" "-" 
[4,] "16" "blue" "-" 
[5,] "20" "blue" "-" 
[6,] "24" "red"  "-" 

What I would like to do is to associate to_match values to the right interval when it has the same color and the same direction. The idea is to have something like this :

     [,1] [,2] [,3]   [,4] [5] 
[1,] "3"  "5"  "blue" "+"  "4"

or the opposite :

     [,1] [,2]   [,3] [4] [5]
[1,] "4"  "blue" "-"  "3" "6"

I started to try using the data.table::between() function but it became a mess quite quickly... In my real dataset the to_match columns is not the same length as interv (not sure if this is relevant)

Paul Endymion
  • 537
  • 3
  • 18
  • 3
    This is *overlap genomic intervals problem*. Since you tagged with `data.table` I posted my recent answer where I use `data.table::foverlaps`. You just need to set key by chromosome and strand (color and direction) - `setkey(interv, chr, strand, start, end); setkey(to_match, chr, strand, start, end); foverlaps(interv, to_match)`. Also, you need to create end column in `to_match`. – pogibas Feb 26 '19 at 10:59
  • 1
    So I simply duplicated the start as an "end" column (as I understand you did in your example) and it did the trick! Thank you! "Brilliant, this is immensely fast." Congrats for understanding I'm working with genomes. – Paul Endymion Feb 26 '19 at 12:48
  • Yes. First turn data into a data.table: `setDT(to_match)`, then add end `to_match[, end := start]`, then set key `setkey(to_match, chr, strand, start, end)`. – pogibas Feb 26 '19 at 12:54

1 Answers1

0

A non-equi join will help you out here..

create sample data

dt1 <- as.data.table( interv, stringsAsFactors = FALSE )
dt2 <- as.data.table( to_match, stringsAsFactors = FALSE )
dt1[, `:=`(V1 = as.numeric(V1), V2 = as.numeric(V2))]
dt2[, `:=`(V1 = as.numeric(V1))]

code

for all matches on intervals:

dt1[ dt2, .(x.V1, x.V2, x.V3, x.V4, i.V1), on = .(V1<=V1, V2>=V1, V3=V2, V4 = V3), allow.cartesian = TRUE][]

output

#     x.V1 x.V2 x.V3 x.V4 i.V1
#  1:    3    5 blue    -    4
#  2:    7   10 blue    -    8
#  3:   11   15 blue    -   12
#  4:   15   20 blue    -   16
#  5:   15   20 blue    -   20
#  6:   19   25 blue    -   20
#  7:   23   30  red    -   24
#  8:   23   30  red    -   28
#  9:   27   35  red    -   28
# 10:   27   35  red    -   32
# 11:   31   40  red    -   32
# 12:   31   40  red    -   36
# 13:   35   45  red    -   36
# 14:   31   40  red    -   40
# 15:   35   45  red    -   40
# 16:   39   50  red    -   40
# 17:    3    5 blue    +    4
# 18:    7   10 blue    +    8
# 19:   11   15 blue    +   12
# 20:   15   20 blue    +   16
# 21:   15   20 blue    +   20
# 22:   19   25 blue    +   20
# 23:   23   30  red    +   24
# 24:   23   30  red    +   28
# 25:   27   35  red    +   28
# 26:   27   35  red    +   32
# 27:   31   40  red    +   32
# 28:   31   40  red    +   36
# 29:   35   45  red    +   36
# 30:   31   40  red    +   40
# 31:   35   45  red    +   40
# 32:   39   50  red    +   40
#     x.V1 x.V2 x.V3 x.V4 i.V1
Wimpel
  • 26,031
  • 1
  • 20
  • 37