How to download PDF of academic papers via Google Scholar query using R or Python

Question

I have a list of titles of academic papers that I need to download. I would like to write a loop to download their PDF files from the web, but can't find a way to do it.

Here is the step-by-step of what I've thought so far (The answer is welcomed to be in R or Python):

# Create list with paper titles (example with 4 papers from different journals)
titles <- c("Effect of interfacial properties on polymer–nanocrystal thermoelectric transport",
            "Reducing social and environmental impacts of urban freight transport: A review of some major cities",
            "Using Lorenz curves to assess public transport equity",
            "Green infrastructure: The effects of urban rail transit on air quality")

#Loop step1 - Query paper title in Google Scholar to get URL of journal webpage containing the paper
#Loop step2 - Download the PDF from the journal webpage and save in your computer

for (i in titles){
                  journal_URL <- query i in google (scholar)
                  download.file (url = journal_URL, pattern = "pdf",
                                 destfile=paste0(i,".pdf")                      
                 }

Complicators:

Loop step1 - The first hit of Google Scholar should be the paper's original URL. However, I've heard Google Scholar is a bit fussy with Bots, so the alternative would be to query Google and get the first URL (hopping it will bring the correct URL)

Loop step2 - Some papers are gated, so I imagine that it would be necessary to include authentication info (user=__ , passwd=__). If I am using my university network, though, this authentication should be automatic, right?

ps. I only need to download the PDF. I'm not interested in getting bibliometric information (e.g. citation records, h-index). For getting bibliometric data, there is some guidance here (R users) and here (python users).

I would suggest using the Crossref API, which mints DOIs for thousands of journals, and is completely open. I maintain a the `rcrossref` package, which you can use from R — sckott, Feb 11 '15 at 22:28

score 6 · Accepted Answer · answered Feb 11 '15 at 22:42

Crossref has a program where publishers can give metadata for links to full text versions of articles. Unfortunately, for publishers like Wiley, Elsevier, and Springer, they may give links, but then you need extra permissions to actually retrieve the content. Fun right? Anyway, some work, e.g., this works for your second title, search crossref, then fetch URLs for full text if provided, then grab xml, (better than PDF IMHO)

titles <- c("Effect of interfacial properties on polymer–nanocrystal thermoelectric transport", "Reducing social and environmental impacts of urban freight transport: A review of some major cities", "Using Lorenz curves to assess public transport equity", "Green infrastructure: The effects of urban rail transit on air quality")

library("rcrossref")
out <- cr_search(titles[2])
doi <- sub("http://dx.doi.org/", "", out$doi[1])
(links <- cr_ft_links(doi, "all"))
$xml
<url> http://api.elsevier.com/content/article/PII:S1877042812005551?httpAccept=text/xml

$plain
<url> http://api.elsevier.com/content/article/PII:S1877042812005551?httpAccept=text/plain

xml <- cr_ft_text(links, "xml")
library("XML")
xpathApply(xml, "//ce:author")[[1]]
<ce:author>
   <ce:degrees>Prof</ce:degrees>
   <ce:given-name>Eiichi</ce:given-name>
   <ce:surname>Taniguchi</ce:surname>
</ce:author>

Thanks @Scott. I think you have partially solved the problem as your answer gets the URLs. However, I still need a way to loop and download the PDFs. — rafa.pereira, Feb 11 '15 at 23:24
If you have the URLs, use the `downloader` package to get the PDFs. You'll probably need to specify `mode = "wb"` to treat the file as binary. You could try to use `utils::download.file`, but if any are HTTPS you'll need the downloader package. — Gregor Thomas, Feb 11 '15 at 23:39
If publishers don't provide links to pdfs, then we have to do extra work in the pkg to figure them out, which we can try at least... — sckott, Feb 11 '15 at 23:41

How to download PDF of academic papers via Google Scholar query using R or Python

1 Answers1