0

I am webscraping a website to collect data for research purposes using RSelenium, docker and rvest.

I've built a script that automatically 'clicks' through the pages of which I want to download content. My problem is that when I run this script, the results change. The amount of observations of the variable I'm interested in change. It concerns about 50.000 observations. When running the script several times, the total amount of observations differs by a few hundred.

I'm thinking it has something to do with the internet connection being too slow or with the website not being able to load quick enough... Or something... When I change Sys.sleep(2) the results change too, but without clear effect of wether changing it to higher numbers makes it worse or better.

In the R terminal I run:

docker run -d -p 4445:4444 selenium/standalone-chrome

Then my code looks something like this:

remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
                             port = 4445L,
                             browserName = "chrome")
remDr$open()
remDr$navigate("url of website")
pages <- 100 # for example, I want information from the first hundred pages
variable <- vector("list", pages)  
i <- 1
while (i <= pages) {
    variable[[i]] <- remDr$getPageSource()[[1]] %>% 
    read_html(encoding = "UTF-8") %>% 
    html_nodes("node that indicates the information I want") %>% # select the information I want
    html_text()
    element_next_page <- remDr$findElement(using = 'css selector', "node that indicates the 'next page button") # select button with which I can go to the next page
    element_next_page$sendKeysToElement(list(key="enter")) # go to the next page
    Sys.sleep(2) # I believe this is done to not overload the website I'm scraping
    i <- i + 1
    }
variable <- unlist(variable)

Somehow running this multiple times this keeps returning different results in terms of the number of observations that remain when I unlist variable.

Does someone have the same experiences and tips on what to do?

Thanks.

Thissen
  • 1
  • 1
  • Hi Thissen, perhaps add a check, to see if the page/element is updated after cliking next? often something in the UI changes, that you can use as a validator. The Sys.sleep() is often used to give the dynamic page time to render – Arcoutte Dec 20 '19 at 14:56
  • Hi Arcoutte, thanks for your comment. I can use: ```remDr$screenshot(display = TRUE)``` to see that the last page is reached. But indeed, incidentally, I can see that my script got stuck loading a page. Perhaps the solution is giving the dynamic page more time to render with Sys.sleep()? – Thissen Dec 20 '19 at 15:28
  • It is, but you will never be certain. Perhaps the following can help: https://stackoverflow.com/questions/43402237/r-waiting-for-page-to-load-in-rselenium-with-phantomjs – Arcoutte Dec 20 '19 at 17:20
  • So, I would do this before selecting each node? My code would look somewhat like this? ``webElem <-NULL while(is.null(webElem)){ webElem <- tryCatch({remDr$findElement(using = 'css selector', "node")}, error = function(e){NULL}) } element_next_page <- remDr$findElement(using = 'css selector', "node") element_next_page$sendKeysToElement(list(key="enter"))`` – Thissen Jan 06 '20 at 15:11

1 Answers1

2

You could consider including the following code before extracting the text :

for(i in 1 : 100)
{
  print(i)
  remDr$executeScript(paste0("scroll(0, ", i * 2000, ")"))
}

This code forces the application to go "almost everywhere in the web page" which can help the page to load some sections that are not loaded. This approach is used in the following post : How to webscrape texts that are contained into sublinks of a link in R?.

Emmanuel Hamel
  • 1,769
  • 7
  • 19