0

I've tried to code an answer to this based on similar questions, however I'm struggling to get any answers working exactly for my use case.

I have 2 genetic datasets (looking at positions of mutations in the genome).

These look like:

#df1:
Chromosome  Min     Max   Group
1           500     1000    1
1           400     1900    2
2           300     400     3
3           2000    2100    4
#df2
Gene    Chromosome  Position
Gene1          1    600
Gene1          1    650 
Gene2          1    1700
Gene3          2    350
Gene4          2    355
Gene5          2    450
Gene6          3    2050

I am looking to find which rows in df2 belong to which Group in df1 - so I am coding to ask if a row in df2 has a matching Chromosome number AND the Position column value is inside the range between the Max and Min columns of df1, then if it is assign that row the same Group number as is in df1. I want this to also duplicate rows in df2 if they appear in multiple Groups.

Expected output of the example is:

Gene    Chromosome  Position   Group
Gene1          1    600          1
Gene1          1    650          1 
Gene1          1    600          2
Gene1          1    650          2 
Gene2          1    1700         2
Gene3          2    350          3
Gene4          2    355          3 
Gene5          2    450          NA
Gene6          3    2050         4

#Gene1 enters both groups 1 and 2 as their Chromosome and Position fits in both those groups of df1

They key thing I'm trying to get that I am missing is that some groups in df1 have overlap in their min-max range, but I still want each individual group to be kept and just Genes/rows of df2 to be duplicated as their position might match mulitple groups with overlap.

Currently I am trying to code this with:

df1$ID <- seq.int(nrow(df1))
df2$ID<- seq.int(nrow(df2))

df2[df1, Group := i.ID, on = .(Chromosome, Position > Min, Position < Max ) ]

For my example this outputs:

Gene  Chromosome Position  ID Group
Gene1   1           600     1   2
Gene1   1           650     2   2
Gene2   1           1700    3   2
Gene3   2           350     4   3
Gene4   2           355     5   3
Gene5   2           450     6   NA
Gene6   3           2050    7   4

So in this case because Groups 1 and 2 have an overlap in their min-max range, Group 1 has been lost from the output result. Is there another way I can code to avoid this and match rows but maintain all Groups despite any range overlaps?

I have tried other ways to code this with a similar question (How to perform join over date ranges using data.table?) using foverlaps() but this also does not work as expected.

Input data:

df1 <-
structure(list(Chromosome = c(1L, 1L, 2L, 3L), Min = c(500L, 
400L, 300L, 2000L), Max = c(1000L, 1900L, 400L, 2100L), Group = 1:4, 
    ID = 1:4), row.names = c(NA, -4L), class = c("data.table", 
"data.frame"))

df2 <-
structure(list(Gene = c("Gene1", "Gene1", "Gene2", "Gene3", "Gene4", 
"Gene5", "Gene6"), Chromosome = c(1L, 1L, 1L, 2L, 2L, 2L, 3L), 
    Position = c(600L, 650L, 1700L, 350L, 355L, 450L, 2050L)), row.names = c(NA, 
-7L), class = c("data.table", "data.frame"))

r2evans
  • 141,215
  • 6
  • 77
  • 149
DN1
  • 234
  • 1
  • 13
  • 38

2 Answers2

0

Are you looking for something like this?

setDT(df1)
setDT(df2)

df2[, Group := df1[.SD, 
                   on = .(Chromosome = Chromosome, Max > Position, Min < Position), 
                   toString(Group), 
                   by = .EACHI]$V1]

#     Gene Chromosome Position Group
# 1: Gene1          1      600  1, 2
# 2: Gene1          1      650  1, 2
# 3: Gene2          1     1700     2
# 4: Gene3          2      350     3
# 5: Gene4          2      355     3
# 6: Gene5          2      450    NA
# 7: Gene6          3     2050     4
s_baldur
  • 29,441
  • 4
  • 36
  • 69
0

The data.table merge notation of A[B,...] performs a left join of B on A ... backwards to the eye. Its equivalents:

  • base R: merge(A, B, all.x = FALSE, all.y = TRUE, ...)
  • dplyr: dplyr::right_join(A, B, ...) or dplyr::left_join(B, A, ...)
  • SQL: ... FROM B left join A ...

With that, I suggest two steps:

  1. Because the range-join causes one of your non-equi joining columns to be renamed, I'll save Position into a new variable and use that instead. It'll wash out.

  2. Reverse the objects.

Therefore,

df1[df2[,Pos0:=Position], on = .(Chromosome, Min < Pos0, Max > Pos0 ) ][,.(Gene,Chromosome,Position,Group)]
#      Gene Chromosome Position Group
#    <char>      <int>    <int> <int>
# 1:  Gene1          1      600     1
# 2:  Gene1          1      600     2
# 3:  Gene1          1      650     1
# 4:  Gene1          1      650     2
# 5:  Gene2          1     1700     2
# 6:  Gene3          2      350     3
# 7:  Gene4          2      355     3
# 8:  Gene5          2      450    NA
# 9:  Gene6          3     2050     4
r2evans
  • 141,215
  • 6
  • 77
  • 149