I'm not sure if it is because my internet is slow, but I'm trying to scrape a website that loads information as you scroll down the page. I'm executing a script that goes to the end of the page, and waits for the Selenium/Chrome server to load the additional content. The server does update and load the new content, because I am able to scrape information that wasn't on the page originally and the new content shows up on the chrome viewer, but it only updates once. I set a Sys.sleep()
function to wait for a minute each time so that the content will have plenty of time to load, but it still doesn't update more than once. Am I using RSelenium incorrectly? Are there other ways of scraping a site that dynamically loads?
Anyway, any kind of advice or help you can provide would be awesome.
Below is what I think is the relevant portion of my code with regards to loading the new content at the end of the page:
for(i in 1:3){
webElem <- remDr$findElement('css', 'body')
remDr$executeScript('window.scrollTo(0, document.body.scrollHeight);')
Sys.sleep(60)
}
Below is the full code:
library(RSelenium)
library(rvest)
library(stringr)
rsDriver(port = 4444L, browser = 'chrome')
remDr <- remoteDriver(browser = 'chrome')
remDr$open()
remDr$navigate('http://www.codewars.com/kata')
#find the total number of recorded katas
tot_kata <- remDr$findElement(using = 'css', '.is-gray-text')$getElementText() %>%
unlist() %>%
str_extract('\\d+') %>%
as.numeric()
#there are about 30 katas per page reload
tot_pages <- (tot_kata/30) %>%
ceiling()
#will be 1:tot_pages once I know the below code works
for(i in 1:3){
webElem <- remDr$findElement('css', 'body')
remDr$executeScript('window.scrollTo(0, document.body.scrollHeight);')
Sys.sleep(60)
}
page_source <- remDr$getPageSource()
kata_vector <- read_html(page_source[[1]]) %>%
html_nodes('.item-title a') %>%
html_attr('href') %>%
str_replace('/kata/', '')
remDr$close