I have a genetic dataset where I want to group genetic variants/rows that are physically close together in the genome. I want to group genes that are within ranges from certain spots in the genome per chromosome (chrom
).
My 'spots' dataset is of positions that variants/rows need to be within a range of and looks like:
chrom low high
1 500 1700
1 19500 20600
5 400 1500
My low
and high
columns are the ranges that I want to see if any rows in my next dataset fall into, with also accounting that the chromosome (chrom
) must also match. Each row with a unique range and chrom combination is its own group for which I am looking to see if anything in my other dataset falls into.
My other dataset has a position value that I'm looking to see if fits in any of the ranges above with matching chrom
, in order to label it as corresponding to that range, and then I can group positions in the same range and chrom together:
Gene chrom position
Gene1 1 1200
Gene2 1 10000
Gene3 5 500
Gene4 5 560
Gene5 1 20100
I've tried using group_by()
and between()
to set up the range, since seeing other questions that are similar for dates/times ranges, but I'm struggling to account for the need to match the chromosome (chrom
) between the datasets before then searching for range.
Output would look like:
Gene chrom position Group
Gene1 1 1200 1 #position is in one of the ranges and matches the chrom so is in a group
Gene2 1 10000 NA #does not fit into any range on chrom 2 (no matches)
Gene3 5 500 2 #position is in one of the ranges and matches the chrom so is in a group
Gene4 5 560 2 #position is in the same range and chrom as above so joins that group
Gene5 1 20100 3 #position matches a chrom and range and so gets a group corresponding to that particular chrom and range
- Gene3 and Gene4 are not in group 1 because they are on a different
chrom
, but they do match the chrom and are within range of of the 3rd line of my first dataset - so they get to be in the group that corresponds to that range and chrom. - Gene5 is not in the same group as Gene1 as whilst they match
chrom
they are in different ranges oflow
andhigh
, so get their own groups for the unique ranges.
So I am creating a Group
column with a shared number for all rows in the same range between low
and high
on the same chrom
, or NA if their position doesn't match in any range and chrom in the first dataset.
Input data:
df1 <-
structure(list(chrom = c(1L, 1L, 5L),
low = c(500L, 19500L, 400L), high = c(1700L, 20600L, 1500L
)), row.names = c(NA, -3L), class = c("data.table", "data.frame"))
df2 <-
structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5"
), chrom = c(1L, 1L, 5L, 5L, 1L), position = c(1200L, 10000L,
500L, 560L, 20100L)), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
I'm also looking into giving my first dataset unique identifiers per each unique range and chrom combination and then assign that identifier to any row in dataset 2 that matches the combination too, so that identifier creates my group numbers column. Although my real data is 2.3k rows of ranges and 82k rows to match into shared groups so I'm also having problems running dplyr options I would normally try.