How to download embedded PDF files from webpage using RSelenium?

Question

EDIT: From the comments I received so far, I managed to use RSelenium to access the PDF files I am looking for, using the following code:

library(RSelenium)
driver <- rsDriver(browser = "firefox")
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
# It needs some time to load the page
option <- remote_driver$findElement(using = 'xpath', "//select[@id='cmbGrupo']/option[@value='PDF|412']")
option$clickElement()

Now, I need R to click the download button, but I could not manage to do so. I tried:

button <- remote_driver$findElement(using = "xpath", "//*[@id='download']")
button$clickElement()

But I get the following error:

Selenium message:Unable to locate element: //*[@id="download"]
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'

Erro:    Summary: NoSuchElement
 Detail: An element could not be located on the page using the given search parameters.
 class: org.openqa.selenium.NoSuchElementException
 Further Details: run errorDetails method

Can someone tell what is wrong here? Thanks!

Original question:

I have several webpages from which I need to download embedded PDF files and I am looking for a way to automate it with R. This is one of the webpages: https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398 This is a webpage from CVM (Comissão de Valores Mobiliários, the Brazilian equivalent to the US Securities and Exchange Commission - SEC) to download Notes to Financial Statements (Notas Explicativas) from Brazilian companies.

I tried several options but the website seems to be built in a way that makes it difficult to extract the direct links. I tried what is suggested in here Downloading all PDFs from URL, but the html_nodes(".ms-vb2 a") %>% html_attr("href") yields an empty character vector. Similarly, when I tried the approach in here https://www.samuelworkman.org/blog/scraping-up-bits-of-helpfulness/, the html_attr("href") generates an empty vector.

I am not used to web scraping codes in R, so I cannot figure out what is happening. I appreciate any help!

There are no pdf files linked on that page, so pkg-rvest seems unlikely to be the solution. I'm guessing you are not used to webscraping in any language. The source of that page is written in Javascript and the button that brings up the dialog window where you can choose pdf files is named "btnGeraRelatorioPDF". I suspect you will need to get a package that has more facilities for interacting with webpages than are provided by rvest. The Selenium package used to be the goto solution but I think that other browser navigation tools are also available. Look at the page source with your browser. — IRTFM, May 15 '21 at 21:21
Searching SO with "[r] javascript dialog" I found this possibly helpful answer: https://stackoverflow.com/questions/29759438/rselenium-popup/29759797#29759797 And here's another possibly useful answer from the same contributor, @JDHarrison: https://stackoverflow.com/questions/38156180/how-to-download-a-file-behind-a-semi-broken-javascript-asp-function-with-r/38294857#38294857 — IRTFM, May 15 '21 at 21:24
Following the question suggestions from @IRTFM, I am using RSelenium, but I cannot figure out how to select "_Notas Explicativas_" in the first dropdown menu, which generates the embedded PDF file I want to download. — Veronica Santana, May 16 '21 at 16:37
@RonakShah, after opening the page `library(RSelenium) driver <- rsDriver(browser = "firefox") remote_driver <- driver[["client"]] remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")` I want to select the option "_Notas Explicativas_" in the first dropdown menu, which generates the pdf file I want to download. — Veronica Santana, May 16 '21 at 16:38
I guess I didn't read your note carefully and my earlier nomination of the button name was incorrect. Looks to me that you need to first select that menu band (or strip? not sure of the correct term) itself. In Chrome doing "Inspect" on that menu strip highlights a section that starts out ` — IRTFM, May 16 '21 at 17:07

score 0 · Answer 1 · answered Sep 13 '21 at 21:22

If someone is facing the same problem I did, I am posting the solution I used:

# set Firefox profile to download PDFs automatically
pdfprof <- makeFirefoxProfile(list(
  "pdfjs.disabled" = TRUE,
  "plugin.scan.plid.all" = FALSE,
  "plugin.scan.Acrobat" = "99.0",
  "browser.helperApps.neverAsk.saveToDisk" = 'application/pdf'))

driver <- rsDriver(browser = "firefox", extraCapabilities = pdfprof)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)

option <- remote_driver$findElement(using = 'xpath', "//select[@id='cmbGrupo']/option[@value='PDF|412']") # select the option to open PDF file
option$clickElement()

# Find iframes in the webpage
web.elem <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem, function(x){x$getElementAttribute("id")}) # see their names
remote_driver$switchToFrame(web.elem[[1]]) # Move to the first iframe (Formularios Filho)
web.elem.2 <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem.2, function(x){x$getElementAttribute("id")}) # see their names
# The pdf Viewer iframe is the only one inside Formularios Filho
remote_driver$switchToFrame(web.elem.2[[1]]) # Move to the first iframe (pdf Viewer)
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)

# Download the PDF file
button <- remote_driver$findElement(using = "xpath", "//*[@id='download']")
button$clickElement() # download
Sys.sleep(3) # Need sometime to finish download and then close the window
remote_driver$close() # Close the window

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-ask). — Community, Sep 14 '21 at 01:11

How to download embedded PDF files from webpage using RSelenium?

1 Answers1

Linked