-1

I have several 100 genomic regions with start end positions. I want to extract gene IDs in each region from the reference file (reference file contains thousands of genes with their start end position). I extracted region information of genes, for example, output file tells me if a gene is present in region one or two and so on. However, each region contains many genes and there are several genes that lie in multiple regions. I want an output file that would put the IDs of all the genes in a cell next to that region (each gene ID separated by comma). If a gene is present in multiple regions, it would appear in multiple cells corresponding to those regions along with other genes of that region. Can this be done with an R code? Please help.

Sample input.

RegionInfoFile

Region  Chr Start   End
1   1A  1   12345
2   1A  23456   34567
3   2A  1234    23456
***
1830 7D 123     45678

GeneInfoFile

Gene    Chr Start   End
GeneID1 1A  831 1437
GeneID2 1A  1487    2436
GeneID3 1B  2665    5455
***
GeneID10101 7D  13456  56789 

RequiredOutPutFile

Region  Chr Start   End   Gene
1       1A  1       12345 GeneID1, GeneID2, GeneID5, GeneID6 ***
***
1830    7D  123     45689 GeneID7, GeneID100, GeneID200 ***
Community
  • 1
  • 1
Luqman
  • 29
  • 7
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Nov 04 '19 at 16:04
  • @MrFlick: I have edited the question. Please have another look. – Luqman Nov 06 '19 at 01:42

1 Answers1

1

What you are looking for is typically accomplished via the excellent GenomicRanges package on Bioconductor. You can create GenomicRanges objects from regions and find where they overlap using the findOverlaps() function. To control what counts as an overlap see the "type" argument in findOverlaps(..., type=""). Here's an example of that in R. I'm sure there are more elegant ways of doing this but that's what I cam up with during lunch:

# install.packages('readr')
library(readr)
# install.packages('BiocManager')
# BiocManager::install('GenomicRanges')
library(GenomicRanges)

# Read data into R. Needs to have column names "chr", "start" and "end"
x <- read_csv('~/Downloads/regions.csv')
y <- read_csv('~/Downloads/genes.csv')

# Set up two GenomicRanges objects
gr_x <- makeGRangesFromDataFrame(x, keep.extra.columns = TRUE)
gr_y <- makeGRangesFromDataFrame(y, keep.extra.columns = TRUE)

# Overlap the regions containing genes with the trait data
ovl <- findOverlaps(gr_y, gr_x, type = "any", select = "all", ignore.strand = TRUE)
# Group hits by trait regions
hits_by_region <- split(from(ovl), to(ovl))

# Create a data.frame that matches a trait region index to a string of genes
hits_df <- do.call(rbind, lapply(seq_along(hits_by_region), function(f) {
  idx <- as.integer(names(hits_by_region[f]))
  genes <- gr_y$Gene[hits_by_region[[f]]]
  data.frame(Index = idx, Genes = paste(genes, collapse=','), stringsAsFactors = FALSE)
}))

# Add the string of genes as metadata to the GRanges object
gr_x$Genes <- ""
gr_x$Genes[hits_df$Index] <- hits_df$Genes

# Convert back to data.frame (if needed)
genes_by_regio_df <- as.data.frame(gr_x)