I have a genetic dataset where I am matching chromosome positions in the genome of 1 file if they fit within chromosome position ranges given in another file.
There are similar questions to this that I have tried, mostly with time intervals, but they haven't worked due to me needing to make sure the chromosome number is also matching (so I don't match identical positions but on differing chromosomes)
My data looks like this:
#df1 - chromosome positions to find within df2 ranges:
Chromosome Position Start End
1 101 101 101
2 101 101 101
3 600 600 600
#df2 - genomic ranges
Chromosome Start End CpG
1 50 200 10
1 300 400 2
4 100 200 5
Expected matched output (also ultimately I am looking to find the matching CpG
column for df1 data):
Chromosome Position Start End CpG
1 101 50 200 10 #only row of df1 that's within a range on df2 on the same chromosome
I am currently trying to do this with:
df <-df1 %>%
left_join(df2,
by = "Chromosome") %>%
filter(Position >= Start & Position <= End)
Error: Problem with `filter()` input `..1`.
x object 'Start' not found
i Input `..1` is `Position >= Start & Position <= End`.
I don't understand how I am getting this error, the Start and End columns exist in both files and are all integer data classes - is there something I'm missing or another way I can solve this?
My actual data is quite large so also if a data.table
solution works for this I am also trying to find it - I've tried but I'm a novice and haven't got far:
df1[df2, on = .(Chromosome, Position > End, Position < Start ) ]
Edit: trying with foverlaps:
setkey(df1)
df2[, End := Start]
foverlaps(df2, df1, by.x = names(df2), type = "within", mult = "all", nomatch = 0L)
Error in foverlaps(df2, df1, by.x = names(df2), type = "within", mult = "all", :
length(by.x) != length(by.y). Columns specified in by.x should correspond to columns specified in by.y and should be of same lengths.