Scrolling through entire page with Rselenium, then extracting a tabular data into a data frame

Question

I am currently trying to scrape a website with a combination of Rselenium,rvest, and the tidyverse.

The goal is to go to this this website, click on one of the links (for instance, "Promo"), and then extract the entire table of data (e.g., card, and graded prices) using rvest.

I was able to get the table extracted without too much of an issue using the following code:

library(RSelenium)
library(rvest)
library(tidyverse)

pokemon <- read_html("https://www.pricecharting.com/console/pokemon-promo")

price_table <- pokemon %>% 
  html_elements("#games_table") %>% 
  html_table()

However, this has a couple of issues: 1) I cannot go through all the different card sets on the inital website link I provided (https://www.pricecharting.com/category/pokemon-cards), and 2) I cannot extract the entire table with this method - only what is primarly loaded.

To mitigate these issues I was looking into Rselenium. What I decided to do was go to the intial website, click on the link to a card set (e.g. "Promo"), and then load the entire page. This workflow can be shown here:

## open driver
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]

## navigate to primary page
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")

## click on the link I want
remDr$findElement(using = "link text", "Promo")$clickElement()

## find the table
table <- remDr$findElement(using = "id", "games_table")

## load the entire table
table$sendKeysToElement(list(key = "end"))

## get the entire source
full_table <- remDr$getPageSource()[[1]]

## read in the table
html_page <- read_html(full_table)


## Do the `rvest` technique I had above.
html_page %>% 
  html_elements("#games_table") %>% 
  html_table()

However, my issue is that I am once again getting the same 51 elements instead of the entire table.

I am wondering if it is possible to combine my two techniques, and where in my coding process this is going wrong.

The entire page is not loading, try to scroll to the bottom before extracting table. — Nad Pat, Nov 03 '21 at 04:28
Does this answer your question? [Scrolling page in RSelenium](https://stackoverflow.com/questions/31901072/scrolling-page-in-rselenium) — Earl Mascetti, Nov 03 '21 at 07:39
No I was using something similar in my initial posting. [This question and answer](https://stackoverflow.com/questions/38817315/rselenium-scroll-down-to-load-web-content?rq=1) ended up helping the issue. However, there was still the problem of the cursor being in the search bar so I had to "click" out of it. — mikeytop, Nov 03 '21 at 18:20

score 1 · Accepted Answer · answered Nov 03 '21 at 04:34

I solved this issue. There were two things that were going on. The first is that the page was automatically loading with the cursor inside of a search bar. I got rid of this by doing remDr$findElement(using = "css", "body")$clickElement() to click into the body of the text. Next, as one great question/answer pointed out, if the scrolling/arrow keys are not working with sendKeysToElement(list(key = "up_arrow")), you should try remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);").

Hence, the a small sample of my script is the following:

library(RSelenium)
library(rvest)
library(tidyverse)

## opens the driver
rD <- rsDriver(browser="firefox", port=4545L, verbose=F)
remDr <- rD[["client"]]

link_texts <- c("Base Set", "Promo", "Fossil")
## navigates to the correct page
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")

for (name in link_texts) {
  ## finds the link and clicks on it
  remDr$findElement(using = "link text", name)$clickElement()
  ## gets the table path
  remDr$findElement(using = "css", "body")$clickElement()
  ## finds the table - this line may be extraneous
  table <- remDr$findElement(using = "css", "body")
  ## scrolls to the bottom of the table
  remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
  Sys.sleep(1)
  remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
  Sys.sleep(1)
  remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
  Sys.sleep(1)
  remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
  Sys.sleep(1)
  remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
  Sys.sleep(1)
  remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
  Sys.sleep(1)
  ## get the entire page source that's been loaded
  html <- remDr$getPageSource()[[1]]
  ## read in the page source
  page <- read_html(html)
  
  data_name <- str_to_lower(str_replace(name, " ","_"))
  ## extract the tabular table
  df <- page %>% 
    html_elements("#games_table") %>% 
    html_table() %>% 
    pluck(1) %>% 
    select(1:4)
  assign(data_name, df)
  Sys.sleep(3)
  remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
}

## close driver
remDr$close()
rD$server$stop()

Nad Pat · Answer 2 · 2021-11-03T08:07:38.873

The page wasn't scrolling down as the cursor by default is in search bar. So did a little modification to your code so it scrolls down completely.

#Launch browser
rD <- rsDriver(browser="firefox", port=9545L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate("https://www.pricecharting.com/category/pokemon-cards")
remDr$findElement(using = "link text", "Promo")$clickElement()

#clicking outside the search bar
remDr$findElement(using = "xpath", value = '//*[@id="console-page"]')$clickElement()

webElem <- remDr$findElement("css", "body")
#looping to get at the end of the page. 
for (i in 1:25){
  Sys.sleep(1)
  webElem$sendKeysToElement(list(key = "end"))
} 

#extract table
full_table <- remDr$getPageSource()[[1]]
html_page <- read_html(full_table)
html_page %>% 
  html_elements("#games_table") %>% 
  html_table()
[[1]]
# A tibble: 888 x 5
   Card                 Ungraded `Grade 9` `PSA 10`  ``                                                                                                        
   <chr>                <chr>    <chr>     <chr>     <chr>                                                                                                     
 1 Mew #8               $3.99    $38.79    $75.62    "+ Collection\n                                        In One Click\n                                    ~
 2 Mewtwo #3            $8.28    $65.91    $227.50   "+ Collection\n                                        In One Click\n                                    ~
 3 Charizard GX #SM211  $7.85    $23.64    $53.50    "+ Collection\n                                        In One Click\n                                    ~
 4 Charizard V #SWSH050 $8.00    $34.99    $79.98    "+ Collection\n                                        In One Click\n                                    ~
 5 Pikachu #24          $138.31  $362.72   $2,919.69 "+ Collection\n                                        In One Click\n                                    ~
 6 Entei #34            $8.50    $52.21    $153.63   "+ Collection\n                                        In One Click\n                                    ~
 7 Ancient Mew          $23.79   $99.99    $382.50   "+ Collection\n                                        In One Click\n                                    ~
 8 Charizard EX #XY121  $27.16   $135.00   $727.00   "+ Collection\n                                        In One Click\n                                    ~
 9 Mewtwo EX #XY107     $5.54    $77.50    $98.71    "+ Collection\n                                        In One Click\n                                    ~
10 Charizard GX #SM60   $28.57   $113.98   $492.00   "+ Collection\n                                        In One Click\n                                    ~
# ... with 878 more rows

Interesting how your code does not work on my computer. Specifically the `webElem$sendKeysToElement(list(key = "end"))` does not work, whereas, as my answer code specified, `remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")` seems to work. However, I rather like your loop approach to ensure that I get to the bottom of the page rather than my quick guess work. One thing to note. I believe there is an error in your solution where you accidentally wrote "click" that was not commented out. — mikeytop, Nov 03 '21 at 06:43

Scrolling through entire page with Rselenium, then extracting a tabular data into a data frame

2 Answers2