1

I've inherited a dataset of RNAseq output data from Canis Lupus (dog). I have the gene identifier in the Ensembl format, specifically they look like this, ENSCAFT00000001452.3. I am trying to use bioMaRt to convert them to a more common ID and need help. I am very novice to R and consider myself rather ignorant. Any help to get started.

Can these Ensembl ID's be converted to any other Ensembl ID (eg. different species)? Can these Ensembl ID's be converted to RefSeq, GI assesscion #? How

Started with this:

library('biomaRt')

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))

genes <- df$genes

.......lost after this. Thanks for any help. Ryan

Community
  • 1
  • 1
Ryan T.
  • 13
  • 1
  • 3
  • 3
    here is a good place to start: https://www.bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html – GordonShumway Aug 29 '18 at 23:42

1 Answers1

1

Here is step-by-step example:

  1. Load the biomaRt library.

    library(biomaRt)
    
  2. As query input we have Canis lupus familiaris Ensembl transcript IDs (note that they are not Ensembl gene IDs). We also need to strip the dot+digit(s) from the end, which is used to indicate annotation updates.

    tx <- c("ENSCAFT00000001452.3", "ENSCAFT00000001656.3")
    tx <- gsub("\\.\\d+$", "", tx)
    
  3. We now query the database for the Ensembl transcript IDs in tx

    ensembl <- useEnsembl(biomart = "ensembl", dataset = "cfamiliaris_gene_ensembl")
    res <- getBM(
        attributes = c("ensembl_gene_id", "ensembl_transcript_id", "external_gene_name", "description"),
        filters = "ensembl_transcript_id",
        values = tx,
        mart = ensembl)
    res
    #ensembl_gene_id ensembl_transcript_id external_gene_name
    #1 ENSCAFG00000000934    ENSCAFT00000001452            COL14A1
    #2 ENSCAFG00000001086    ENSCAFT00000001656                MYC
    #                                                                   description
    #1               collagen type XIV alpha 1 chain [Source:VGNC Symbol;Acc:VGNC:51768]
    #2 MYC proto-oncogene, bHLH transcription factor [Source:VGNC Symbol;Acc:VGNC:43527]
    

Note that you can get a data.frame of all attributes for a particular mart with listAttributes(ensembl).

Additionally to the link @GordonShumway gives in the comment above, another good (and succinct) summary/introduction to biomaRt can be found on the Ensembl websites.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Thank you! I am stuck trying to figure out how to execute this operation on a .csv file whereby the column is the Canis lupus familiaris Ensembl transcript ID and there are 275 rows. Thanks – Ryan T. Aug 30 '18 at 01:36
  • @RyanT. Just extract the column with the transcript IDs as a `character` vector `tx`, remove the trailing dot+digit, and use `getBM(..., values = tx, ...)` just as in my post. `getBM` is already vectorised for a vector of query IDs. – Maurits Evers Aug 30 '18 at 01:38
  • @RyanT. PS. I have edited my post to show results for two transcripts to illustrate. – Maurits Evers Aug 30 '18 at 01:42
  • 1
    Thank you very much. I have it figured out! – Ryan T. Aug 30 '18 at 15:42