1

I've been trying to batch download pdfs from a list of urls. Sadly, each of these urls are actually a visualisation of the pdf and have a download button on them and I can't figure out how to get them.

When I did this for a different website, I used this code (now with some of the links I need):

urls <- c("https://dom-web.pbh.gov.br/visualizacao/edicao/2714",
          "https://dom-web.pbh.gov.br/visualizacao/edicao/2714",
          "https://dom-web.pbh.gov.br/visualizacao/edicao/2716",
          "https://dom-web.pbh.gov.br/visualizacao/edicao/2718",
          "https://dom-web.pbh.gov.br/visualizacao/edicao/2720",
          "https://dom-web.pbh.gov.br/visualizacao/edicao/2721")
names = c("DECRETO Nº 17.297.pdf",
          "DECRETO Nº 17.298.pdf",
          "DECRETO Nº 17.304.pdf",
          "DECRETO Nº 17.308.pdf",
          "DECRETO Nº 17.309.pdf",
          "DECRETO Nº 17.313.pdf")

for (i in 1:length(urls)){
  download.file(urls[i], destfile =  names[i], mode = 'wb')
}

For another website, this led to nice pdfs being downloaded to my working directory. This one is just empty ones. I've tried the solutions from [https://stackoverflow.com/questions/36359355/r-download-pdf-embedded-in-a-webpage] and [https://stackoverflow.com/questions/42468831/how-to-set-up-rselenium-for-r], but I continue to fail miserably.

If anyone has a lightbulb moment and can help me out, that would be the bee's knees.

Larissa
  • 27
  • 7
  • 1
    In the inspect element, there is a link you can download directly from R such as `https://api-dom.pbh.gov.br/api/v1/documentos/18215786809b00f6b45d6efac68fe5a8fe0430052573402a9c4555e84f6d9e58` – Chamkrai Jun 18 '22 at 17:42
  • I think the two parts should be `paste`d together, with a slash between, and used as `url`. With the `destfile=` you actually just define the name/path on your machine? – jay.sf Jun 18 '22 at 18:38

1 Answers1

1

Since you mentioned RSelenium, here is one solution.

library(tidyverse)
library(rvest)
library(RSelenium)
library(netstat)

rD <- rsDriver(browser = "firefox", port = free_port())
remDr <- rD[["client"]]

get_links <- function(pages) {
  remDr$navigate(pages)
  Sys.sleep(10)
  remDr$getPageSource()[[1]] %>%
    read_html() %>%
    html_element("#app > div > div > div.card.p-1.mb-3.bg-white.rounded > div.card-body > div > iframe") %>%
    html_attr("src")
}

df <- tibble(
  links = c(
  "https://dom-web.pbh.gov.br/visualizacao/edicao/2714",
  "https://dom-web.pbh.gov.br/visualizacao/edicao/2714",
  "https://dom-web.pbh.gov.br/visualizacao/edicao/2716",
  "https://dom-web.pbh.gov.br/visualizacao/edicao/2718",
  "https://dom-web.pbh.gov.br/visualizacao/edicao/2720",
  "https://dom-web.pbh.gov.br/visualizacao/edicao/2721"
)) %>% 
  mutate(pdf_links = map(links, get_links)) %>% 
  unnest(pdf_links)

names = c("DECRETO Nº 17.297.pdf",
          "DECRETO Nº 17.298.pdf",
          "DECRETO Nº 17.304.pdf",
          "DECRETO Nº 17.308.pdf",
          "DECRETO Nº 17.309.pdf",
          "DECRETO Nº 17.313.pdf")

for (links in 1:length(df$pdf_links)) {
  download.file(df$pdf_links[links], destfile = names[links])
}

Screenshot

Chamkrai
  • 5,912
  • 1
  • 4
  • 14
  • Thank you so much! For some reason, they came out disconfigured, but I used the links in my original code and now it's worked out perfectly. – Larissa Jun 18 '22 at 21:04