How to group rows in a range and consider a 3rd column?

Question

I have a genetic dataset where I want to group genetic variants/rows that are physically close together in the genome. I want to group genes that are within ranges from certain spots in the genome per chromosome (chrom).

My 'spots' dataset is of positions that variants/rows need to be within a range of and looks like:

 chrom      low       high
   1        500       1700
   1        19500     20600
   5        400       1500

My low and high columns are the ranges that I want to see if any rows in my next dataset fall into, with also accounting that the chromosome (chrom) must also match. Each row with a unique range and chrom combination is its own group for which I am looking to see if anything in my other dataset falls into.

My other dataset has a position value that I'm looking to see if fits in any of the ranges above with matching chrom, in order to label it as corresponding to that range, and then I can group positions in the same range and chrom together:

Gene   chrom position 
Gene1   1    1200          
Gene2   1    10000        
Gene3   5    500 
Gene4   5    560
Gene5   1    20100

I've tried using group_by() and between() to set up the range, since seeing other questions that are similar for dates/times ranges, but I'm struggling to account for the need to match the chromosome (chrom) between the datasets before then searching for range.

Output would look like:

Gene   chrom position   Group 
Gene1   1    1200          1  #position is in one of the ranges and matches the chrom so is in a group    
Gene2   1    10000        NA  #does not fit into any range on chrom 2 (no matches)
Gene3   5    500           2  #position is in one of the ranges and matches the chrom so is in a group
Gene4   5    560           2  #position is in the same range and chrom as above so joins that group
Gene5   1    20100         3  #position matches a chrom and range and so gets a group corresponding to that particular chrom and range

Gene3 and Gene4 are not in group 1 because they are on a different chrom, but they do match the chrom and are within range of of the 3rd line of my first dataset - so they get to be in the group that corresponds to that range and chrom.
Gene5 is not in the same group as Gene1 as whilst they match chrom they are in different ranges of low and high, so get their own groups for the unique ranges.

So I am creating a Group column with a shared number for all rows in the same range between low and high on the same chrom, or NA if their position doesn't match in any range and chrom in the first dataset.

Input data:

df1 <- 
structure(list(chrom = c(1L, 1L, 5L), 
   low = c(500L, 19500L, 400L), high = c(1700L, 20600L, 1500L
    )), row.names = c(NA, -3L), class = c("data.table", "data.frame"))

df2 <- 
structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5"
), chrom = c(1L, 1L, 5L, 5L, 1L), position = c(1200L, 10000L, 
500L, 560L, 20100L)), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"))

I'm also looking into giving my first dataset unique identifiers per each unique range and chrom combination and then assign that identifier to any row in dataset 2 that matches the combination too, so that identifier creates my group numbers column. Although my real data is 2.3k rows of ranges and 82k rows to match into shared groups so I'm also having problems running dplyr options I would normally try.

This post might help - https://stackoverflow.com/questions/24480031/overlap-join-with-start-and-end-positions — Ronak Shah, Nov 07 '20 at 12:20
I've updated your question to match the provided data set to what you are actually showing. regarding the question itself, I would just do `df2[, grp := df1[.SD, which = TRUE, on = .(chrom, low <= position, high >= position)]]` (If you don't really care about the groups order) — David Arenburg, Nov 11 '20 at 21:20
How is this not a duplicate of post linked by Ronak: https://stackoverflow.com/questions/24480031/overlap-join-with-start-and-end-positions — zx8754, Nov 12 '20 at 22:47
I am trying to also group matches and trying to consider code that will run on my larger real data. The overlaps option doesn't work for my real data I either get errors or it runs indefinitely for me. — DN1, Nov 13 '20 at 11:05

Onyambu · Answer 1 · 2020-11-11T21:48:24.900

If you know sql then this can quickly be solved in sql + R:

df1$group_id <- seq(nrow(df1)) #This creates the unique groups for each interval

sqldf::sqldf('
    SELECT df2.*, df1.group_id 
    FROM df2 
    LEFT JOIN df1 
    ON df2.chrom = df1.chrom AND position between low AND high')

  Gene chrom position group_id
1 Gene1     1     1200        1
2 Gene2     1    10000       NA
3 Gene3     5      500        3
4 Gene4     5      560        3
5 Gene5     1    20100        2

denis · Accepted Answer · 2020-11-16T09:23:46.567

You could use non equi join in data.table:

library(data.table)
df1 <- setDT(df1)
df2 <- setDT(df2)

df1[,group := 1:.N]
df1[df2,on = .(chrom, low < position, high > position)]


   chrom   low  high group  Gene
1:     1  1200  1200     1 Gene1
2:     1 10000 10000    NA Gene2
3:     5   500   500     3 Gene3
4:     5   560   560     3 Gene4
5:     1 20100 20100     2 Gene5

Here I first set a group for each line of df1. After the merge, the line is associated to a group if the condition is met.

Non equi merge are not super intuitive, but super powerfull, and explicit: the merging condition .(chrom, low < position, high > position) is letterally what you explicited (you want same chromosome, and position between low and high).

In data.table, when you do

df1[df2,on = something]

you subset df1 with the lines of df2 meeting the condition expressed by on. If something is just a common variable of df1 and df2, then it is equivalent to

merge(df1,df2,all.y = T,by = "someting")

But something can be a list of variable and conditions between the variables of your two data.tables. Here, .() indicates a list, and .(chrom,low < position, high > position) indicate you merge on the variable chrom (identical between the two data.tables), and low < position, and high > position. When you express inequality, you must start with the variable from the main data.table (df1 here), then the variables of the subsetting data.table (df2).

The output of this non equi merge using inequalities replace the variable expressed in inequalities of the main data.table (i.e. df1) by the variables of the subsetting data.table (i.e. df2 here), and so low and high become position. If you want to keep the low and high values, you should copy them in an other variable, or merge on a copy of these variables.

You can actually do the opposite merge, wew you subset df2 by df1 entries, with the same condition:

df2[df1,on = .(chrom,position >low , position<high)]

    Gene chrom position position.1 group
1: Gene1     1      500       1700     1
2: Gene5     1    19500      20600     2
3: Gene3     5      400       1500     3
4: Gene4     5      400       1500     3

Here you subset df1 with the entries of df2 meeting the conditions expressed in on = .(), and obtain the list of Gene that actually belong to a group (Gene2 is not here because it does not match the subset).

Similarly to what has been explained above, here position become low and high

Edit

I just saw @DavidArenburg 's comment, and it is a more condensed and better version of what I proposed and explained:

df2[, grp := df1[.SD, which = TRUE, on = .(chrom, low <= position, high >= position)]]

directly associate the result of the non equi merge df1[df2,on = .(chrom, low < position, high > position)] to the group variable, using which = TRUE, which gives you the line of df2 which meet the merge condition of df1[df2 , on =....].

score 2 · Answer 3 · answered Nov 09 '20 at 09:35

As pointed out in the comments, you can just need to use findOverlaps from GenomicRanges to find the rows in your reference dataframe that encompass your rows in the second data.frame

Your df2 is a bit different from that showed in the example, so I change it to match:

df2 = structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5"
), chrom = c(1L, 1L, 5L, 5L, 1L), position = c(1200L, 10000L, 
500L, 560L, 20100L)), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"))

And your df1 has a different order:

  chrom   min   max   low  high
1     1  1000  1200   500  1700
2     1 20000 20100 19500 20600
3     5   900  1000   400  1500

We can make a GenomicRanges object like below:

library(GenomicRanges)
gr1 = makeGRangesFromDataFrame(df1,start.field="low",end.field="high")
gr1$Group = 1:length(gr1)

        GRanges object with 3 ranges and 1 metadata column:
      seqnames      ranges strand |     Group
         <Rle>   <IRanges>  <Rle> | <integer>
  [1]        1    500-1700      * |         1
  [2]        1 19500-20600      * |         2
  [3]        5    400-1500      * |         3

Then do likewise for the second dataframe and find the overlap:

gr2 = makeGRangesFromDataFrame(df2,start.field="position",end.field="position")
ovlp = as.data.frame(findOverlaps(gr2,gr1))
df2$Group = ovlp$subjectHits[match(1:length(gr2),ovlp$queryHits)]

  Gene chrom position Group
1 Gene1     1     1200     1
2 Gene2     1    10000    NA
3 Gene3     5      500     3
4 Gene4     1      560     1
5 Gene5     1    20100     2

ekoam · Answer 4 · 2020-11-09T09:58:04.690

Here is a data.table solution. We can use the foverlaps function introduced in that post cited by Ronak.

library(data.table)

setDT(df1, key = c("chrom", "low", "high"))[
  , c("min", "max", "Group") := .(NULL, NULL, .I)
]
setDT(df2)[, position2 := position]
res <- foverlaps(
  df2, df1, 
  by.x = c("chrom", "position", "position2"), 
  type = "within"
)[
  , .(Gene, chrom, position, Group)
]

Output

> res
    Gene chrom position Group
1: Gene1     1     1200     1
2: Gene2     1    10000    NA
3: Gene3     5      500     3
4: Gene4     1      560     1
5: Gene5     1    20100     2

How to group rows in a range and consider a 3rd column?

4 Answers4

Edit

Linked