I have several 100 genomic regions with start end positions. I want to extract gene IDs in each region from the reference file (reference file contains thousands of genes with their start end position). I extracted region information of genes, for example, output file tells me if a gene is present in region one or two and so on. However, each region contains many genes and there are several genes that lie in multiple regions. I want an output file that would put the IDs of all the genes in a cell next to that region (each gene ID separated by comma). If a gene is present in multiple regions, it would appear in multiple cells corresponding to those regions along with other genes of that region. Can this be done with an R code? Please help.
Sample input.
RegionInfoFile
Region Chr Start End
1 1A 1 12345
2 1A 23456 34567
3 2A 1234 23456
***
1830 7D 123 45678
GeneInfoFile
Gene Chr Start End
GeneID1 1A 831 1437
GeneID2 1A 1487 2436
GeneID3 1B 2665 5455
***
GeneID10101 7D 13456 56789
RequiredOutPutFile
Region Chr Start End Gene
1 1A 1 12345 GeneID1, GeneID2, GeneID5, GeneID6 ***
***
1830 7D 123 45689 GeneID7, GeneID100, GeneID200 ***