0

I have two csv files. One of them contains two breakpoint positions per row, along with their corresponding chromosome numbers as well as the sample that those breakpoints are from. The other file contains a start and end position as well as a sample name and chromosome number.

Some breakpoint positions fall within the start and end positions of the other file. I want to see if there are any breakpoint positions that do not fall within any of those start and end positions. The chromosome numbers and sample names must match.

I want to compare each of these positions (pos1 and pos2)

Example of file with breakpoint positions

        sample chr1 pos1       chr2 pos2
   1    A01-28  1   59679925    1   204187341
   2    A01-28  1   17727050    21  39859974
   3    A01-28  1   40443937    2   179382940
  ...
5720    Z05-65  14  74930698    14  77657362
4999    Z05-65  8   54849551    11  87898249
5000    Z05-65  14  74928588    14  76065367

to see if any do NOT fall between any of these start and end values

Example of file with start and end positions

        sample chr  start    end
   1    A01-28  1   3218610  6198652
   2    A01-28  1   6198745  8625449
   3    A01-28  1   8630794  9666687
  ...
19491   Z05-65  X   142569607   151391630
19492   Z05-65  X   151393577   151394249
19493   Z05-65  X   151394464   154905589

and the chromosome numbers and sample names have to match.

I've read each file into data frames. I'm not sure how to go about doing this. I'm thinking a for loop could take forever since one file has 5000+ entries and the other has 19000+ entries. I'm not very proficient in R and I know there's probably some kind of clever way of doing this.

zx8754
  • 52,746
  • 12
  • 114
  • 209
  • 2
    I haven't done this for years, so cannot give an example. But I would recommend learning how to use the GenomicRanges package from Bioconductor. It has a `findOverlaps` method [for this exact use case](https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html). – neilfws Jun 09 '20 at 02:56
  • data.table also allows findoverlaps – Waldi Jun 09 '20 at 05:25
  • use `data.table::foverlaps()` as a starting point.. you can also use a non-equi join on two data.tables. – Wimpel Jun 09 '20 at 06:19

0 Answers0