How to match data based on a range of values

Question

I have 2 genetic datasets where I am trying to find if a variant at a certain position in the genome (file1) is matching/found within the ranges any of my rows in another dataset (file2), then extract the matches found file 2 to merge with file 1. The one condition is that the matches are only searched for variants if they have the same chromosome. For example:

File1:

Chromosome    Position
1              3
1              47
2              10
3              2

File2:

Chromosome    Start    End
1              101      102
1              40       50  
2              40       50
3              20       22

Expected output:

Chromosome    Start    End
1              40       50 
#this is the only row from which a variant from file1 fits in its position range and is on the same chromosome

Ideally, I would merge in the file1 variant to align with it's matched chromosome start and end position in file2 all in the same row, but I am new to R and stuck on the first step of trying to match the variant based on if it's position number is within the range of the second file. Currently I am trying to adapt:

dt1[ dt2, match := i.,ID  #including a made-up ID column for the sake of trying to adapt this code 
     on = .(Chromosome, Position > Start, Position < End ) ]

however this doesn't seem work, and beyond this I don't know how else to start. Any help on how to approach this would be appreciated

Data:

dput(file1)
structure(list(Chromosome = c(1L, 1L, 2L, 3L), Position = c(3L, 
47L, 10L, 2L)), row.names = c(NA, -4L), class = c("data.table", 
"data.frame"))

dput(file2)
structure(list(Chromosome = c(1L, 1L, 2L, 3L), Start = c(101L, 
40L, 40L, 20L), End = c(102L, 50L, 50L, 22L)), row.names = c(NA, 
-4L), class = c("data.table", "data.frame"))

`bedtools intersect` not to your liking? In R you could use `findOverlaps` from the GenomicRanges Bioconductor package. — Konrad Rudolph, Feb 19 '20 at 15:30
Thank you for this, I was not aware of these, I will look into them both. — DN1, Feb 19 '20 at 16:13

Jonathan V. Solórzano · Accepted Answer · 2020-02-19T16:07:06.703

2

You could use the tidyverse package to do some recoding and get the chromosomes where its Position value is between the Start and End.

library(tidyverse)

df<-file1 %>%
  # Join by Chromosome, it will duplicate each Position by Start and End Values
  left_join(file2, 
            by = "Chromosome") %>% 
  # Create a new column to indicate if the Position is between Start and End values
  mutate(isRange = Position >= Start & Position <= End) %>%
  # Filter to stay with only the chromosomes where the previous condition is met
  filter(isRange)

edited Feb 19 '20 at 16:07

answered Feb 19 '20 at 15:42

Jonathan V. Solórzano

4,720
10
22

Why make `isRange` a numeric value instead of a logical? – Konrad Rudolph Feb 19 '20 at 15:48
You could as well code it as `F`, `T` instead of `0`, `1`. It will obtain the same result. – Jonathan V. Solórzano Feb 19 '20 at 15:53
2

Three points: (1) the *result* is the same but the solution using logical values is objectively better because it’s more direct. Your solution does the same, it just adds an additional, redundant indirection. (2) In particular, when using logicals, there’s no need to use `ifelse`, nor to use logical constant literals: just write `mutate(isRange = Position >= Start & Position <= End)`, and `filter(isRange)`; (3) Don’t use `T` and `F` instead of `TRUE` and `FALSE`. They’re shorter but they are *variables* that can be overwritten. – Konrad Rudolph Feb 19 '20 at 16:01
1

Already edited the post to include your suggestions, as it results in a shorter and more direct solution. – Jonathan V. Solórzano Feb 19 '20 at 16:09
1

We could also drop *mutate* step just put the condition within *filter*: `filter(Position >= Start & Position <= End)` – zx8754 Feb 21 '20 at 12:46

How to match data based on a range of values

1 Answers1