I am webscraping a website to collect data for research purposes using RSelenium, docker and rvest.
I've built a script that automatically 'clicks' through the pages of which I want to download content. My problem is that when I run this script, the results change. The amount of observations of the variable I'm interested in change. It concerns about 50.000 observations. When running the script several times, the total amount of observations differs by a few hundred.
I'm thinking it has something to do with the internet connection being too slow or with the website not being able to load quick enough... Or something... When I change Sys.sleep(2)
the results change too, but without clear effect of wether changing it to higher numbers makes it worse or better.
In the R terminal I run:
docker run -d -p 4445:4444 selenium/standalone-chrome
Then my code looks something like this:
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("url of website")
pages <- 100 # for example, I want information from the first hundred pages
variable <- vector("list", pages)
i <- 1
while (i <= pages) {
variable[[i]] <- remDr$getPageSource()[[1]] %>%
read_html(encoding = "UTF-8") %>%
html_nodes("node that indicates the information I want") %>% # select the information I want
html_text()
element_next_page <- remDr$findElement(using = 'css selector', "node that indicates the 'next page button") # select button with which I can go to the next page
element_next_page$sendKeysToElement(list(key="enter")) # go to the next page
Sys.sleep(2) # I believe this is done to not overload the website I'm scraping
i <- i + 1
}
variable <- unlist(variable)
Somehow running this multiple times this keeps returning different results in terms of the number of observations that remain when I unlist variable
.
Does someone have the same experiences and tips on what to do?
Thanks.