0

I have two big data with start and end positions and I want to find the common between them according to these positions. But I do not want just find rows that are exactly similar in positions. for example, in the first file, I have a big range from 1 to 300000 and in the second one, I have a range from 100000 to 210000, so the ranges in the second file are inside of the ranges in the first file and I want to consider this as a common or when just one side of the range is covered by another file. here I mentioned an example data:

df1 <- setDT(data.frame(name = c("chr1", "chr2", "chr3", "chr4"), START = c(1, 300000, 470000, 800000), END = c(200000, 370000, 500000, 990000), STRAND = c("+", "+", "+", "-"))) 
df2 <- setDT(data.frame(name = c("chr1", "chr2", "chr3", "chr4"), START = c(55000, 365000, 372000, 750000), END = c(187000, 371000, 469000, 835000), STRAND = c("+", "+", "+", "-"))) 

therefore: chr1 in df2 is covered by chr1 in df1 so it is common, chr2 in df2 is common as well because start position is covered by df1, chr3 is not common because it is not covered and chr4 is common because end position in df2 is covered by start position in df1.

  • I want to find common between two big files like what I mentioned not just by exact similarity.

Could someone who knows help me how can I do this?

Waldi
  • 39,242
  • 6
  • 30
  • 78

2 Answers2

1

You could use foverlaps:

library(data.table)
setDT(df1,df2)
setkey(df2,START,END)
foverlaps(df1,df2,by.x=c('START','END'),
                  by.y=c('START','END'),
                  type='any',
                  nomatch = NULL)

     name  START    END STRAND i.name i.START  i.END i.STRAND
   <char>  <num>  <num> <char> <char>   <num>  <num>   <char>
1:   chr1  55000 187000      +   chr1   1e+00 200000        +
2:   chr2 365000 371000      +   chr2   3e+05 370000        +
3:   chr4 750000 835000      -   chr4   8e+05 990000        -
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • the code is good when I tried with model that I posted but I have a problem, while I used setDT to have my real data in .table format, after running the code I faced the error that the x is not a data.table. Do you know where is wrong? **Error in setkeyv(x, cols, verbose = verbose, physical = physical) : x is not a data.table** – mehdi heidari Jul 17 '22 at 09:10
  • 1
    You could verify `class(real_data)` : is it `data.table`? – Waldi Jul 17 '22 at 09:19
1

Another solution also based on data.table.

cols = c("S", "E")
df1[, (cols) := .(START, END)] # useful to keep START/END columns unaltered while joining.
df2[, (cols) := .(START, END)]

df1[df2, on=.(name, E>=S, S<=E), nomatch=NULL][, (cols) := NULL]

     name  START    END STRAND i.START  i.END i.STRAND
1:   chr1      1 200000      +   55000 187000        +
2:   chr2 300000 370000      +  365000 371000        +
3:   chr4 800000 990000      -  750000 835000        -