4

Take the following generic data

A <- c(5,7,11,10,23,30,24,6)
B <- c(1,2,3,1,2,3,1,2)
C <- data.frame(A,B)

and the following intervals

library(intervals)
interval1 <- Intervals(
  matrix(
    c(
      5, 15,
      15, 25,
      25, 35,
      35, 100
    ),
    ncol = 2, byrow = TRUE
  ),
  closed = c( TRUE, FALSE ),
  type = "Z"
)
rownames(interval1) <- c("A","B","C", "D")

interval2 <- Intervals(
  matrix(
    c(
      0, 10,
      12, 20,
      22, 30,
      30, 100
    ),
    ncol = 2, byrow = TRUE
  ),
  closed = c( TRUE, FALSE ),
  type = "Z"
)
rownames(interval2) <- c("P","Q","R", "S")

Now I want to create the following output table

enter image description here

So where the A value overlap the two invervals, I want to 'copy' all the data to a line below. We also introduce data$X which is the interval1 value and data$y which is the interval2 value. Where data does not fit within any of the interval, I want to remove it from the data.frame

I am not sure if the break() function would be better used to create the intervals or if the dplyr function can be used to make the reoccuring data rows

lukeg
  • 1,327
  • 3
  • 10
  • 27
  • I do not understand. Sorry but your explanation is not clear enough. Can you elaborate how you obtain the 4 first lines of your final `data.frame`? – Colonel Beauvel May 18 '15 at 12:00
  • I hope this will clarify.... the value of 5 appears in the interval1 as 'A' and interval2 'P'... the value of 7 appears in the interval1 as 'A' and interval2 'P'... the value of 11 appears in the interval1 as 'A' but not within any interval2 bounds – lukeg May 18 '15 at 12:04
  • thks! do you want to use absolutely intervals package? – Colonel Beauvel May 18 '15 at 12:25
  • no, I am open to most efficient solution – lukeg May 18 '15 at 12:26

1 Answers1

5

You can use foverlaps in data.table:

library(data.table)
C.DT <- data.table(C)
C.DT[, A1:=A] # required for `foverlaps` so we can do a range search

# `D` and `E` are your interval matrices

I1 <- data.table(cbind(data.frame(D), idX=LETTERS[1:4], idY=NA))
I2 <- data.table(cbind(data.frame(E), idX=NA, idY=LETTERS[16:19]))

setkey(I1, X1, X2)  # set the keys on our interval ranges
setkey(I2, X1, X2)

rbind(
  foverlaps(C.DT, I1, by.x=c("A", "A1"), nomatch=0), # match every value in `C.DT$A` to the ranges in `I1` 
  foverlaps(C.DT, I2, by.x=c("A", "A1"), nomatch=0)
)[order(A, B), .(A, B, X=idX, Y=idY)]

Produces:

     A B  X  Y
 1:  5 1  A NA
 2:  5 1 NA  P
 3:  6 2  A NA
 4:  6 2 NA  P
 5:  7 2  A NA
 6:  7 2 NA  P
 7: 10 1  A NA
 8: 10 1 NA  P
 9: 11 3  A NA
10: 23 2  B NA
11: 23 2 NA  R
12: 24 1  B NA
13: 24 1 NA  R
14: 30 3  C NA
15: 30 3 NA  R
16: 30 3 NA  S

Note you can easily change what you get instead of NA, by modifying the steps where I1 and I2 are created.

BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • Thanks, that works great, can you please explain the setkey() function – lukeg May 18 '15 at 13:31
  • 1
    @lukeg That is a `data.table` function that orders your table by the columns selected, which then allows `data.table` to search through those columns knowing they are ordered (this allows for fast searches). – BrodieG May 18 '15 at 13:38
  • 1
    @lukeg, you should post a new question that captures the additional complexity while trying to keep the problem as simple as possible. – BrodieG May 18 '15 at 17:49