5

I have an expression set matrix with the rownames being what I think is a GENCODE ID in the format for example "ENSG00000000003.14" "ENSG00000000457.13" "ENSG00000000005.5" and so on. I would like to convert these to gene_symbol but I am not sure of the best way to do so, especially because of the ".14" or ".13" which I believe is the version. Should I first trim all IDs for what is after the dot and then use biomaRt to convert? if so, what is the most efficient way of doing it? Is there a better way to get to the gene_symbol?

Many thanks for you help

Manish Goel
  • 843
  • 1
  • 8
  • 21
r_mvl
  • 109
  • 5
  • 2
    Maybe, this is what your are looking for : https://stackoverflow.com/questions/28543517/how-can-i-convert-ensembl-id-to-gene-symbol-in-r. Probably, You'll get more help here if you post it here: https://www.biostars.org/ – S Rivero Jul 03 '17 at 20:41
  • Those are ensembl ids and biomart is probably your best option. Here is a previous question, you just need to change `attributes` accordingly. The biomart tutorial is very helpful\ – emilliman5 Jul 03 '17 at 21:04
  • Thanks for the replies. I tried biomaRt but it doesn't recognises it as Ensembl gene ID because of the "dot number" at the end of the ensembl gene id (eg "ENSG00000000003**.14**") – r_mvl Jul 03 '17 at 21:07

2 Answers2

2

As already mentioned, these are ENSEMBL IDs. First thing, you would need to do is to check your expression set object and identify which database it uses for annotations. Sometimes, the IDs may map to different gene symbols in newer (updated) annotation databases.

Anyway, expecting that the IDs belong to Humans, you can use this code to get the gene symbols very easily.

library(org.Hs.eg.db)       ## Annotation DB
library(AnnotationDbi)

ids <- c("ENSG00000000003", "ENSG00000000457","ENSG00000000005")
gene_symbol <- select(org.Hs.eg.db,keys = ids,columns = "SYMBOL",keytype = "ENSEMBL")

You can try with org.Hs.eg.db or the exact db your expression set uses (if that information is available).

Manish Goel
  • 843
  • 1
  • 8
  • 21
0

Thanks for the help. My problem was to get rid of the version .XX at the end of each ensembl gene id. I thought there would be a more straight forward way of going from an ensembl gene id that has the version number (gencode basic annotation) to a gene symbol. In the end I did the following and seem to be working:

df$ensembl_gene_id <- gsub('\\..+$', '', df$ensembl_gene_id)

library(biomaRt)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- df$ensembl_gene_id
symbol <- getBM(filters = "ensembl_gene_id",
                attributes = c("ensembl_gene_id","hgnc_symbol"),
                values = genes, 
                mart = mart)
df <- merge(x = symbol, 
              y = df, 
              by.x="ensembl_gene_id",
              by.y="ensembl_gene_id")
r_mvl
  • 109
  • 5