5

I have a long data frame of genes and various forms of ids for them (e.g. OMIM, Ensembl, Genatlas). I want to get the list of all SNPs that are associated with each gene. (This is the reverse of this question.)

So far, the best solution I have found is using the biomaRt package (bioconductor). There is an example of the kind of lookup I need to do here. Fitted for my purposes, here is my code:

library(biomaRt)

#load the human variation data
variation = useEnsembl(biomart="snp", dataset="hsapiens_snp")

#look up a single gene and get SNP data
getBM(attributes = c(
  "ensembl_gene_stable_id",
  'refsnp_id',
  'chr_name',
  'chrom_start',
  'chrom_end',
  'minor_allele',
  'minor_allele_freq'),
  filters = 'ensembl_gene',
  values ="ENSG00000166813",
  mart = variation
)

This outputs a data frame that begins like this:

  ensembl_gene_stable_id  refsnp_id chr_name chrom_start chrom_end minor_allele minor_allele_freq
1        ENSG00000166813  rs8179065       15    89652777  89652777            T          0.242412
2        ENSG00000166813  rs8179066       15    89652736  89652736            C          0.139776
3        ENSG00000166813 rs12899599       15    89629243  89629243            A          0.121006
4        ENSG00000166813 rs12899845       15    89621954  89621954            C          0.421126
5        ENSG00000166813 rs12900185       15    89631884  89631884            A          0.449681
6        ENSG00000166813 rs12900805       15    89631593  89631593            T          0.439297

(4612 rows)

The code works, but the running time is extremely long. For the above, it takes about 45 seconds. I thought maybe this was related to the allele frequencies, which the server perhaps calculated on the fly. But looking up the bare minimum of only the SNPs rs ids takes something like 25 seconds. I have a few thousand genes, so this would take an entire day (assuming no timeouts or other errors). This can't be right. My internet connection is not slow (20-30 mbit).

I tried looking up more genes per query. This did dot help. Looking up 10 genes at once is roughly 10 times as slow as looking up a single gene.

What is the best way to get a vector of SNPs that associated with a vector of gene ids?

If I could just download two tables, one with genes and their positions and one with SNPs and their positions, then I could easily solve this problem using dplyr (or maybe data.table). I haven't been able to find such tables.

Community
  • 1
  • 1
CoderGuy123
  • 6,219
  • 5
  • 59
  • 89
  • 2
    Since your query is so simple the long runtime is a Biomart database restriction and it’s unlikely that there’s a faster way. That said, this *is* very slow, and should usually definitely run faster. On my computer it runs in less than 5 seconds. But the Biomart server is famously capricious. – Konrad Rudolph Jan 05 '17 at 17:11
  • Ideally, I would just download two tables in CSV format somewhere and do my own filtering. But I can't find any such tables. There apparently used to be some, according to [this 5 year old question](https://www.biostars.org/p/12323/). – CoderGuy123 Jan 05 '17 at 18:08
  • You can still download the dbSNP bulk data, for instance here: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b146_GRCh38p2/ – Konrad Rudolph Jan 05 '17 at 18:31
  • Yes, but the formats are terrible. I spent multiple hours trying to open these files without luck. Why can't they just make available the data in a normal format like csv? – CoderGuy123 Jan 05 '17 at 22:09
  • Using "5 year old answer" download both SNPs and Genes, they are zipped flat text files. Then do the [merge with overlap](http://stackoverflow.com/questions/24480031/roll-join-with-start-end-window). – zx8754 Jan 05 '17 at 22:15
  • 1
    @Deleet It’s the industry standard format. There are tons of parsers for it, including ones for R/Bioconductor, e.g. ‹VariantAnnotation›. In fact, I believe that Bioconductor probably also has pre-made dbSNP databases, though you’d have to investigate that yourself. – Konrad Rudolph Jan 05 '17 at 22:34

1 Answers1

6

Since you're using R, here's an idea that uses the package rentrez. It utilizes NCBI's Entrez database system and in particular the eutils function, elink. You'll have to write some code around this and probably tweak parameters, but could be a good start.

library(rentrez)
# for converting gene name -> gene id
gene_search <- entrez_search(db="gene", term="(PTEN[Gene Name]) AND Homo sapiens[Organism]", retmax=1)
geneId <- gene_search$ids 
# elink function
snp_links <- entrez_link(dbfrom='gene', id=geneId, db='snp')

# access results with $links
length(snp_links$links$gene_snp)
5779

head(snp_links$links$gene_snp)
'864622690' '864622594' '864622518' '864622451' '864622387' '864622341' 

I suggest you manually double-check that the number of SNPs is about what you'd expect for your genes of interest -- you may need to drill down further and limit by transcript, etc...

For multiple gene ids:

multi_snp_links <- entrez_link(dbfrom='gene', id=c("5728", "374654"), db='snp', by_id=TRUE)
lapply(multi_snp_links, function(x) head(x$links$gene_snp))
1.    '864622690' '864622594' '864622518' '864622451' '864622387' '864622341' 
2.    '797045093' '797044466' '797044465' '797044464' '797044463' '797016353' 

The results are grouped by gene with by_id=TRUE

Kevin
  • 7,960
  • 5
  • 36
  • 57