0

I am trying to webscrape this website.

enter image description here

As you can see there is one main link and a series of titles that you can click to have access to text. What I would like to get in the end is the text in all these sublinks of the main link. I am not very familiar with webscraping so having a look around I thought that something like:

library(rvest)

x <- read_html("https://www.ecb.europa.eu/press/pressconf/html/index.en.html")

x1 <- html_nodes(x, ".doc-title a") # this using selector gadget

This attempt however badly fails. Is there anyone who can help me with that?

TylerH
  • 20,799
  • 66
  • 75
  • 101
Rollo99
  • 1,601
  • 7
  • 15
  • 2
    The results are loaded in after the page has loaded, so such a scraping attempt will unfortunately not succeed. The links that are loaded in follow a nice pattern, though, and you can easily collect them by year. For example, 2020: https://www.ecb.europa.eu/press/pressconf/2020/html/index_include.en.html – Bas Jul 21 '20 at 14:45
  • @Bas This sounds good. I can replace "2020" and loop it through, right? – Rollo99 Jul 21 '20 at 14:50
  • 1
    Yes, you can. I found this out by right-clicking on the web page, clicking 'inspect element', going to the 'network' tab, and filtering by `XHR` (data) requests. As you refresh the page and scroll down, you can see the requests being made. – Bas Jul 21 '20 at 19:08

1 Answers1

1

It is possible to get the text of the links of the initial page :

library(RSelenium)
library(rvest)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.ecb.europa.eu/press/pressconf/html/index.en.html")

# This is useful to load all the page
for(i in 1 : 100)
{
  print(i)
  remDr$executeScript(paste0("scroll(0, ", i * 2000, ")"))
}

Sys.sleep(5)
html_Content <- remDr$getPageSource()[[1]]
html_Link <- str_extract_all(string = html_Content, pattern = "/press/pressconf/[^<]*html")[[1]]
html_Link_En <- html_Link[str_detect(html_Link, "\\.en\\.html")]
links_To_Remove <- c("/press/pressconf/html/index.en.html", "/press/pressconf/visual-mps/html/index.en.html" )
html_Link_En <- html_Link_En[!(html_Link_En %in% links_To_Remove)]
html_Link_En <- unique(html_Link_En)

# Extract text from first link
# It is possible to use a for loop to get the text of all links ...
html_Content <- read_html(paste0("https://www.ecb.europa.eu", html_Link_En[1]))
html_Content %>% html_text()
Emmanuel Hamel
  • 1,769
  • 7
  • 19