5

I would like to test a bunch of genomic locations of the form:

chr4:154723876-154724615
chr6:139580853-139581090
chr18:30440532-30441569

I want to see whether they are located in an UTR or intron or exon or an intergenic sequence. I don't care for information about in which genes' introns (etc.) these coordinates are.

I assume that each known genetic element (like an exon) has defined genomic location (start-end position in the genome on each chromosome). I know this is true for exons and introns, as for example Ensembl has IDs for each exon in the genome: see example of exons and introns of Amy1 gene in Mus musclulus. I want to query a database of such locations with the above list of my locations, and if there is an overlap between the two (ideally I should be able to specify the overlap, say, at least 10bp, but if not I am OK), I should get a hit (yes, this region is in the exon/intron/)

And the handicap is that I have a few thousand of these locations and would ideally like to query them in all one go and as an output have a table where each location would be assigned "intron/exon/utr/intergenic". The organism is Mus musculus and the locations are from across the genome.

I cannot for now provide a code sample of what I am trying to do because I don't know where to start - if I had a package or anything to build upon it would help me find the solution.

Would be perfect if I could do it in R, but AFAIK I can't do it in biomaRt and I couldn't find a package to do it. I thought of Galaxy, but given their non-trivial way of doing it and strange output they produce I would rather stick to R. The devil you know etc.

Help would be much appreciated.

yotiao
  • 273
  • 2
  • 12
  • Hi. Please post a sample of your data, along with examples of what it means to be "located in a UTR" etc. In addition, even if you don't have `R` code, post the algorithm you're trying to code up so we can see you've tried something. – Carl Witthoft Nov 20 '13 at 12:28
  • Hi, thanks. I edited a lot to make it clearer. Hope it's better now. – yotiao Nov 20 '13 at 13:41
  • Carl, this is not about an algorithm, you can’t decide that kind of thing with an algorithm!! This is about a package that would do queries in an appropriate database. I am interested by an answer as well. – Elvis Nov 20 '13 at 13:44
  • You hould ass 'bionformatics' and maybe 'bioconductor' in the tags. 'mapping' doesn’t seem appropriate. – Elvis Nov 20 '13 at 13:46
  • Done. The other tags were put there by people who moved me from stats.stackexchange :-) – yotiao Nov 20 '13 at 13:47
  • 1
    More appropriate on [biostars.org](http://biostars.org/) – Arun Nov 20 '13 at 13:48
  • @Elvis Um, yes, this **is** about an algorithm. You need to write an algorithm that does the pattern matching and looks at the proper elements of the data file. So if you want to know how to query the database, we need to understand the structure of said database and what constitutes a match (by number or character, not by some intellectual interpretatation) – Carl Witthoft Nov 20 '13 at 13:57
  • A position on a genome is given, you want to do a query a data base to obtain what is located at this position (an intron, an exon, an intergenic region, a promoting region, etc). That’s all. That’s hardly an algorithm, and in any case there’s nothing more to write. The question could be reworded like ``do you know a database which can be used for this and an R package which does the queries''? – Elvis Nov 20 '13 at 14:11
  • @yotiao I suggest to use [bedtools](https://bedtools.readthedocs.org/en/latest/) intersect option. Right tool for the job. Agree with Arun, [biostars.org](http://www.biostars.org/) is a better place for this question. – zx8754 Nov 20 '13 at 16:02

4 Answers4

1

OK, sorry it took me so long, but the paper is submitted and the way I did it finally was to:

1) Download the list of genomic coordinates for whole genes, exons, introns and so-called 3'-UTR exons and 5'-UTR exons from UCSC table browser using Ensembl gene annotation. The only finicky bit is that you have to download a file for whole genes and the rest separately, and the manual does not explicitly state what "whole gene" is. But if you paste the coordinates it produces into Genome Browser you could see it is 5' UTR, all introns and axons and 3' UTR.

2) Use BEDtools package (Quinlan and Hall 2010, https://www.ncbi.nlm.nih.gov/pubmed/20110278), a very nice manual with simple examples is here: http://bedtools.readthedocs.org/en/latest/ and used the intersect command with -f flag that let me set a minimum overlap (in bp or in %) between my coordinates and the UCSC one.

It worked like a charm - I got a tabulated file with overlaps of each feature. Hope this helps.

yotiao
  • 273
  • 2
  • 12
0

NCBI has a chromosome map viewer

http://www.ncbi.nlm.nih.gov/mapview/maps.cgi?TAXID=9606&CHR=4&MAPS=ideogr,morbid[11164.00%3A11170.00]&QSTR=EVC%20OR%20HD%20OR%20FGFR3%20OR%20SNCA%20OR%20NRCLP%20OR%20FOP&QUERY=uid(1968,2105,2886,6280,13348,20241,9026199,9026201,9026283,9026440,9027752,9027884)&zoom=100

on the left you have two search boxes that says region show.

Tamalero
  • 471
  • 1
  • 7
  • 14
  • Well, yes, Genome Browser is even better and Ensembl viewer is cool too. But it doesn't help me at all if I have 1000 of locations to test. I could do it manually one by one for a bunch of them, but not all... – yotiao Nov 20 '13 at 14:22
0

This is not a complete answer, but I hope this helps.

The bioconductor package BSgenome.Mmusculus.UCSC.mm10 contains the (last assembly of) the mouse sequence. These two lectures (1 and 2) show how to use this kind of package. It seems that you can retrieve exons and introns with the bioconductor package GenomicFeatures which interacts with the UCSC database.

I think this is a good direction to look at. I don’t have time to find more information, please share what you will find.

Elvis
  • 548
  • 2
  • 14
0

as always there are several ways of doing things, but one really fast is using HOMER annotatePeaks.pl script. You just need to download the HOMER scripts and call: annotatePeaks.pl.

your_bed_file genome > your_output_file.

You can find the documentation in link Just be careful and format your bed file (what you called "genomic locations" file) following the columns structure that HOMER recognize. In the output file, you will get a column called "annotation" and another called "detailed annotation" where you will find the information of in which genomic region you find that coordinates.

(intron, exons, 5'UTR,3'UTR, intergenic, non-coding, GC islands...)

Is the faster way but some of the genes match that you can get using Bedtools maybe are not found by HOMER because the annotation does not exist. This has happened to me, in this case, what I did was, using R biomaRt, download the info of these genes that were found by bedtools and without HOMER annotation, and annotate them with the TSS, 5'UTR,exons starts and ends and then as you have the exact location of your peak r binding position with a "for if" statement you can find the genomic region.

gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79