8

I want to download all the pubmed data abstracts. Does anyone know how I can easily download all of the pubmed article abstracts?

I got the source of the data : ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/af/12/

Is there anyway to download all these tar files..

Thanks in advance.

Soundarya Thiagarajan
  • 574
  • 2
  • 13
  • 31

2 Answers2

4

There is a package called rentrezhttps://ropensci.org/packages/. Check this out. You can retrieve abstracts by specific keywords or PMID etc. I hope it helps.

UPDATE: You can download all the abstracts by passing your list of IDS with the following code.

    library(rentrez)
    library(xml)

your.ids <- c("26386083","26273372","26066373","25837167","25466451","25013473")
# rentrez function to get the data from pubmed db
fetch.pubmed <- entrez_fetch(db = "pubmed", id = your.ids,
                      rettype = "xml", parsed = T)
# Extract the Abstracts for the respective IDS.  
abstracts = xpathApply(fetch.pubmed, '//PubmedArticle//Article', function(x)
                               xmlValue(xmlChildren(x)$Abstract))
# Change the abstract names with the IDS.
names(abstracts) <- your.ids
abstracts
col.abstracts <- do.call(rbind.data.frame,abstracts)
dim(col.abstracts)
write.csv(col.abstracts, file = "test.csv")
user5249203
  • 4,436
  • 1
  • 19
  • 45
  • I am getting error in xpathApply : Error: could not find function "xpathApply" – Soundarya Thiagarajan Nov 06 '15 at 06:24
  • I imported XML library and it worked. Thanks a lot. I am just figuring out to write the data now. > write(trnL, "Test/trnL.fasta") Error in file(file, ifelse(append, "a", "w")) : cannot open the connection can you help me out in this.. how do I get the abstract data after this.. – Soundarya Thiagarajan Nov 06 '15 at 08:23
  • 1
    sorry, it is hard for me to understand from a single line. Can you post a new SO Q, with what you did and what resulted in error. Thanks. – user5249203 Nov 06 '15 at 15:45
  • I am getting the ids using your.ids from your statement of code, i want the abstract data from it, I am just figuring out how do I get the data. Thanks :) – Soundarya Thiagarajan Nov 06 '15 at 15:46
  • Please tick the answer to (green), if it solves your problem. Thank you. – user5249203 Jan 20 '16 at 16:21
2

I appreciate that this is a somewhat old question.

If you wish to get all the pubmed entries with python I wrote the following script a while ago:

import requests
import json

search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&mindate=1800/01/01&maxdate=2016/12/31&usehistory=y&retmode=json"
search_r = requests.post(search_url)
search_data = search_r.json()
webenv = search_data["esearchresult"]['webenv']
total_records = int(search_data["esearchresult"]['count'])
fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmax=9999&query_key=1&webenv="+webenv

for i in range(0, total_records, 10000):
    this_fetch = fetch_url+"&retstart="+str(i)
    print("Getting this URL: "+this_fetch)
    fetch_r = requests.post(this_fetch)
    f = open('pubmed_batch_'+str(i)+'_to_'+str(i+9999)+".json", 'w')
    f.write(fetch_r.text)
    f.close()

print("Number of records found :"+str(total_records))

It starts of by making an entrez/eutils search request between 2 dates which can be guaranteed to capture all of pubmed. Then from that response the 'webenv' (which saves the search history) and total_records are retrieved. Using the webenv capability saves having to hand the individual record ids to the efetch call.

Fetching records (efetch) can only be done in batches of 10000, the for loop handles grabbing batches of 9,999 records and saving them in labelled files until all the records are retrieved.

Note that requests can fail (non 200 http responses, errors), in a more robust solution you should wrap each requests.post() in a try/except. And before dumping/using the data to file you should ensure that the http response has a 200 status.

DanB
  • 23
  • 4
  • Thank you so much DanB ! Same thing I tried in Python long back - it didnt work - this would be really useful for me. thank you. – Soundarya Thiagarajan Aug 10 '16 at 14:06
  • 1
    Note that this will take about 24 hours of runtime to execute on a single core. Consider using Pool and map instead of the for loop to run multiple concurrent efetch requests from pubmed. The number of concurrent http requests will be dependent on the bandwidth you have available. – DanB Aug 11 '16 at 14:53
  • This solution does not work, since it's not allowed to set retstart higher than 9998. The following message gets otherwise returned: ``` Search backend cannot retrieve history data. Reason: Exception: 'retstart' cannot be larger than 9998. For PubMed, ESearch can only retrieve the first 9,999 records matching the query. To obtain more than 9,999 PubMed records, consider using EDirect that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved. For details see https://www.ncbi.nlm.nih.gov/books/NBK25499/ ``` – Oliver Küchler Feb 25 '23 at 15:31