I want to process more than one bed files for finding overlapped regions. I read my data set as data frame, and how can I efficiently scanning two data set in parallel in order to detect where is the overlapped regions occurred. My approach is every time I am taking peak regions of the each cell of data frame objects as query, taking peak region of all row of another data frame in intervaltree, then searching overlapped regions. I am confused how to implement this in R. Please help about processing bed format files in bioinformatics. Appreciated if someone point me out how to do this ...
This is the simple example that what I want to achieve:
[1] chr1 [10171, 10226] * | MACS_peak_1 7.12
[2] chr1 [32698, 33079] * | MACS_peak_2 13.92
[3] chr1 [34757, 34794] * | MACS_peak_3 6.08
[4] chr1 [37786, 37833] * | MACS_peak_4 2.44
[5] chr1 [38449, 38484] * | MACS_peak_5 3.61
[6] chr1 [38584, 38838] * | MACS_peak_6 4.12
..
..
[] chrX [155191467, 155191508] * | MACS_peak_77948 3.80
[] chrX [155192786, 155192821] * | MACS_peak_77949 3.71
[] chrX [155206352, 155206433] * | MACS_peak_77950 3.81
[] chrX [155238796, 155238831] * | MACS_peak_77951 3.81
[n-1] chrX [155246563, 155246616] * | MACS_peak_77952 2.44
[n] chrX [155258442, 155258491] * | MACS_peak_77953 5.08
#step 1: read two bed files in R:
bed_1 <- as(import.bed(bedFile_1), "GRanges")
bed_2 <- as(import.bed(bedFile_2), "GRanges")
bed_3 <- as(import.bed(bedFile_3), "GRanges")
step 2: extract first row of the bed_1 (only take one specific interval as query). each row is considered as one specific genomic interval
peak <- bed_1[1] # only take one row once
bed_2.intvl <- GenomicRanges::GIntervalTree(bed_2)
#step 3: find overlapped regions:
overlap <- GenomicRanges::findOverlaps(peak, bed_2.intvl)
# step 4: go back to original bed_2 and look at which interval were overlapped with peak that comes from bed_1, what's the significance of each of these interval that comes from bed_2.
# step 5: then iterate next interval from bed_1 to repeat above process