Scrape dynamically-generated webpage in R without needing to download Docker

Question

I am trying to scrape a table of legislators from the following website: https://www.legis.ga.gov/members/house

First I tried Rvest, but that did not work because the page is dynamically generated.

library(rvest)
url <- 'https://www.legis.ga.gov/members/house'
page = read_html(url)
page %>% 
  html_element("table") %>%
  html_table()

#Error in View : no applicable method for 'html_table' applied to an object of class "xml_missing"

Then I tried RSelenium. That did not work because it could not determine server status.

library(RSelenium)
rD = rsDriver(browser="chrome", port=4234L, chromever="109.0.5414.74")

#Warning message:
#In rsDriver(browser = "chrome", port = 4234L, chromever = "109.0.5414.74") :
#  Could not determine server status.

library(wdman)
selServ <- wdman::selenium(verbose = FALSE)
selServ$log()

#$stderr
#[1] ""
#
#$stdout
#[1] ""

Then I tried to install the Splash package. I got this warning:

"Warning in install.packages : package ‘splashr’ is not available for this version of R"

Looking at other Stackoverload questions, several suggest downloading something called Docker (e.g., How to set up rselenium for R?). But it looks like this will then involve launching Docker and doing several complicated steps each time I need to scrape something. It doesn't make sense to go through all those steps to scrape a single table. I'm also leery of downloading software if it's not necessary. What is the simplest way to scrape this table? Am I missing something obvious that I was supposed to do?

For reference: I am using Chrome version 109.0.5414.119 on macOS Ventura, R Version 4.2.0.

Not sure if you are using Rstudio or not, but I get that message a ton within Rstudio, but when I attempt the same installation outside of Rstudio, it works fine. Then I go back in Rstudio call library(mypackage), and it works fine... — cory, Feb 07 '23 at 13:24
Here's a fairly thorough blog post of a step-by-step with rselenium: https://appsilon.com/webscraping-dynamic-websites-with-r/ — Paul Stafford Allen, Feb 07 '23 at 13:32
I am working in RStudio. Are you referring to the installation for Splash? It looks like Splashr was removed from the cran repository (https://cran.r-project.org/web/packages/splashr/index.html), so does it make sense to try to install it another way? Not sure how I would do that. The steps in the blog post don't work for me because I still get the "Could not determine server status" error when I try to specify a chrome version. — user3710004, Feb 07 '23 at 15:54

score 1 · Answer 1 · answered Feb 07 '23 at 19:11

Sniff their API from th network section and call on it with httr2

library(httr2)
library(tidyverse)

"https://www.legis.ga.gov/api/members/list/1031?chamber=1" %>% 
  request() %>%
  req_auth_bearer_token("eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJuYmYiOjE2NzU3OTY4NDgsImV4cCI6MTY3NTc5NzE0OCwiaWF0IjoxNjc1Nzk2ODQ4fQ.3pxxIurHe8uPgXGY0DZay0wAUk8Lf5rbHsGvXiNbYMY") %>% 
  req_perform() %>% 
  resp_body_json(simplifyVector = TRUE) %>% 
  as_tibble() 

# A tibble: 178 × 13
   district$id party photos lastName  districtNumber city     fullName sessionId    id name$first
         <int> <int> <list> <chr>              <int> <chr>    <chr>        <int> <int> <chr>     
 1          59     0 <df>   Adesanya              43 Marietta Solomon…      1031  5025 Solomon   
 2         158     0 <df>   Adeyina              110 Grayson  Segun A…      1031  5044 Segun     
 3          99     0 <df>   Alexander             66 Hiram    Kimberl…      1031   806 Kimberly  
 4          12     1 <df>   Anderson              10 Cornelia Victor …      1031  4989 Victor    
 5          54     0 <df>   Anulewicz             42 Smyrna   Teri An…      1031  4915 Teri      
 6          72     0 <df>   Au                    50 Johns C… Michell…      1031  4983 Michelle  
 7         203     1 <df>   Ballard              147 Warner … Bethany…      1031  5051 Bethany   
 8          29     1 <df>   Ballinger             23 Canton   Mandi B…      1031   808 Mandi     
 9         132     0 <df>   Barnes                86 Tucker   Imani B…      1031  5037 Imani     
10          30     1 <df>   Barrett               24 Cumming  Carter …      1031  5020 Carter    
# … with 168 more rows, and 11 more variables: district$number <int>, $chamberType <int>,
#   $suffix <chr>, name$last <chr>, $middle <chr>, $nickname <chr>, $suffix <chr>,
#   $familyName <chr>, districtAddress <df[,8]>, residence <chr>, dateVacated <lgl>

This is very helpful, thank you! I right-clicked on the API page and saved it as georgia_house.json. Then the only code I had to write was: library(jsonlite) house = fromJSON("./data/georgia_house.json") However, using httr2 didn't work for me. How do I get a req_auth_bearer_token? — user3710004, Feb 07 '23 at 23:08

score 0 · Answer 2 · answered Apr 06 '23 at 18:57

It turned out that I did not need to scrape this page at all, because the data was available as a JSON in the network section. However, I was able to find out how to get around the RSelenium server status error. I simply had to set the Chrome version to NULL.


    driver <- rsDriver(browser = "firefox",
                           chromever=NULL,
                           verbose = FALSE)

Scrape dynamically-generated webpage in R without needing to download Docker

2 Answers2