0

I have some experience using the rvest package to scrape data I need from the web, but am hitting an issue with this page:

https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html

If you scroll down a bit, you'll see a portion where all the schools are located.

enter image description here

I would like to have the school, cases and location data. I should note that someone asked on the NYT GitHub publishing this as a csv and they recommended that the data is all in the page and can just be pulled from there. Thus, I think it is OK to scrape from this page.

But I can't get it to work. Let's say I just want to start with a simple selector for that first school. I use the inspector to find the xpath.

enter image description here

I get no results:

library(rvest)

URL <- "https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html"
pg <- read_html(URL)

# xpath copied from inspector
xpath_first_school <- '//*[@id="school100663"]'

node_first_school <- html_node(pg, xpath = xpath_first_school)

> node_first_school
{xml_missing}
<NA>

I get {xml_missing}.

I obliviously have a lot more to do to generalize this and gather data for all schools, but with web scraping, I usually try to start simple and specific and then broaden out. But even my simple test isn't working. Any ideas?

Nick Criswell
  • 1,733
  • 2
  • 16
  • 32
  • 1
    I looked at the website and even within the browser devtools network traffic, and there does not appear to be a trivial way to get at the data. That is, when `read_html` (or even `html_session`) is called, the school data appears to be not present. It is then updated at some point in the http-conversation. However, none of the `js`, `json`, or `plain` connections appear to bear the data you're seeking, so I suspect that they're being very careful: you can see the data on the web page, but they are making it difficult to scrape programmatically. – r2evans Aug 29 '20 at 22:56
  • 2
    While it is cumbersome and not without flaw, I suspect that one way forward (since ha`rvest`ing it "simply" is not working for me) is to try a headless browser setup using `RSelenium`. That's a bit out of my wheel-house, but perhaps that gives you some ideas. – r2evans Aug 29 '20 at 22:56
  • @r2evans, thank you. [There is a site](https://callumgwtaylor.github.io/blog/2018/02/01/using-rselenium-and-docker-to-webscrape-in-r-using-the-who-snake-database/) that shows how to use `RSelenium` with `rvest` to get this data. I will research that and then perhaps post as an answer. Appreciate your taking a look and confirming that I wasn't just failing at using the simpler approach! – Nick Criswell Aug 29 '20 at 23:00
  • I've been too-quick to recommend `RSelenium` in the past, likely because I hit a threshold of *"couldn't find it with `rvest` in a reasonable amount of time"* ... so don't completely give up on it .. all I did (tbh) was grep all downloads (of those three types) for `Birmingham` (one of the schools, found none) or reasonable-looking URLs that might indicate a non-json data source (looked for `nyt.com`, found nothing but images and news-meta). My guess is that it is slightly obfuscated within `js` code ... good luck, Nick. – r2evans Aug 29 '20 at 23:08
  • 1
    r2evans, in case you were wondering, @KKW, and I fought Selenium on this for a while and then got clued into a solution which allowed us to read the text of the page and parse some JSON. Not an extremely generalizable solution, but it does work on this problem. – Nick Criswell Aug 31 '20 at 19:21

2 Answers2

1

Setting up Rselenium can take some time. First you have to download chromedriver (https://chromedriver.chromium.org/), select the version your current chrome is the closest too. Then unpack it to your R working directory.

I tried using a package called decapitated where it can scrape javascript rendered websites but because this website contains "show more" which needs to be physically clicked before all data are shown, I had to use Rselenium to "click" that before I get the page source then use rvest for parsing

Code:

library(rvest)
library(tidyverse)
library(RSelenium)

url <- "https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html"

driver <- rsDriver(browser = c("chrome"), chromever = "85.0.4183.87", port = 560L)
remote_driver <- driver[["client"]] 
remote_driver$navigate(url)

showmore <- remote_driver$findElement(using = "xpath", value = "//*[@id=\"showall\"]/p")
showmore$clickElement()

test <- remote_driver$getPageSource()

school <- read_html(test[[1]]) %>%
  html_nodes(xpath = "//*[contains(@id, \"school\")]/div[2]/h2") %>%
  html_text() %>%
  as.tibble()

case <- read_html(test[[1]]) %>%
  html_nodes(xpath = "//*[contains(@id, \"school\")]/div[3]/p") %>%
  html_text() %>%
  as.tibble() 

location <- read_html(test[[1]]) %>%
  html_nodes(xpath = "//*[contains(@id, \"school\")]/div[4]/p") %>%
  html_text() %>%
  as.tibble() 

combined_table <- bind_cols(school,case = case[2:nrow(case),],location = location[2:nrow(location),]) 
names(combined_table) <- c("school", "case", "location")

combined_table %>% view()

Output:

# A tibble: 913 x 3
   school                                      case  location              
   <chr>                                       <chr> <chr>                 
 1 University of Alabama at Birmingham*        972   Birmingham, Ala.      
 2 University of North Carolina at Chapel Hill 835   Chapel Hill, N.C.     
 3 University of Central Florida               727   Orlando, Fla.         
 4 University of Alabama                       568   Tuscaloosa, Ala.      
 5 Auburn University                           557   Auburn, Ala.          
 6 North Carolina State University             509   Raleigh, N.C.         
 7 University of Georgia                       504   Athens, Ga.           
 8 Texas A&M University                        500   College Station, Texas
 9 University of Texas at Austin               483   Austin, Texas         
10 University of Notre Dame                    473   Notre Dame, Ind.      
# ... with 903 more rows

Hope this works for you!

KKW
  • 367
  • 1
  • 11
  • Ok, getting closer. I have used `Rselenium` in the past and it goes something like this (using a Windows machine with Docker: `shell("docker run -d -p 4445:4444 selenium/standalone-chrome"); remdr <- remoteDriver(port = 4445L, browserName = "chrome"); remdr$open() remdr$navigate("https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html")` etc, etc, etc...When I use this approach just replaceing `remote_driver` from your code with `remdr` in mine, I only get those first 15 schools. When I try the `rsDriver` approach, it errors out. Says the versions are misaligned. – Nick Criswell Aug 31 '20 at 00:41
  • The error specifically says, `Selenium message:session not created: This version of ChromeDriver only supports Chrome version 85.` If I change the `chromver` to the one for my browser...`rsDriver(browser = c("chrome"), chromever = "84.0.4147.89", port = 560L)` I get: `Error in chrome_ver(chromecheck[["platform"]], chromever) : version requested doesnt match versions available = 85.0.4183.38,85.0.4183.83,85.0.4183.87` Note, when I run `binman::list_versions("chromedriver")`, I get `$win32 [1] "85.0.4183.38" "85.0.4183.83" "85.0.4183.87"` so I'm not sure why I can't use the ver 85 – Nick Criswell Aug 31 '20 at 00:43
  • Did you download chromedriver 84 since your browser is version 84? – KKW Aug 31 '20 at 11:13
  • My version of Chrome is 84.0.4147.89. I followed the instructions [here](https://chromedriver.chromium.org/downloads/version-selection) and downloaded the windows [file here](https://chromedriver.storage.googleapis.com/index.html?path=84.0.4147.30/). I run `driver <- rsDriver(browser = c("chrome"), chromever = "84.0.4147.30", port = 560L)` and get the error `Error in chrome_ver(chromecheck[["platform"]], chromever) : version requested doesnt match versions available = 85.0.4183.38,85.0.4183.83,85.0.4183.87` which matches what I see when I run `binman::list_versions("chromedriver")` – Nick Criswell Aug 31 '20 at 12:56
  • Interesting. 1) Do you have another chromedriver.exe in your working directory? 2) Have you tried putting the chromedriver.exe (ver 84) directly onto your current working directory. 3) If tried all of the above, perhaps update your Chrome to the latest version and try chromedriver 85 and see if that works! Sorry, I've had on and off problem with Rselenium when I'm using it. It does take a while to set up – KKW Aug 31 '20 at 16:35
  • Yes, I did the download and then extracted the zip file into my working directory. There are no other drivers there. This is a work machine so I'm stuck with the version of Chrome I'm on. I appreciate all of your help on this. I am not sure why my normal approach using Firefox/Chrome with `RSelenium::remoteDriver` is not working, but I just came across some Python `BeautifulSoup` code that a colleague put together that seems to work. – Nick Criswell Aug 31 '20 at 17:04
  • Glad to hear you found another way to do that. Python & Selenium works so much better than R and Rselenium. If you still want R for data-wrangling and Rmarkdown for publication you could always use the package reticulate to combine python + R. All the best! Do you mind sharing that Beautifulsoup code? I'd love to see how one could get the entire pagesource without physically "clicking" on the "show more". thanks! – KKW Aug 31 '20 at 18:02
  • So the solution actually is scraping the json from the site which I don't even really know how to do with `R`. Researching now... `from bs4 import BeautifulSoup import pandas as pd import urllib import json import re url = "https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html" page = urllib.request.urlopen(url) soup = BeautifulSoup(page, 'html.parser') finder = re.findall(r'var NYTG_schools = .*;', soup.text) data = str(finder[0]) # Case and point start = data.find(" = ") stop = data.find(";") data = data[start+3:stop] json_data = json.loads(data)` – Nick Criswell Aug 31 '20 at 18:09
  • If you skip the whole Inspector thing and just View Page Source, you can Ctrl+F for "var NYTG_schools" to see where the JSON starts. Not sure how to get that in `R`, but that is going to be what I'll be chasing now that I know the JSON is already exposed without any Selenium/click actions. – Nick Criswell Aug 31 '20 at 18:15
  • So much more elegant than using selenium! – KKW Aug 31 '20 at 18:31
  • Here is a link https://github.com/yusuzech/r-web-scraping-cheat-sheet I found useful but I still can't seem to use V8 or dev tools to do the above – KKW Aug 31 '20 at 18:54
  • I have a method that is uglier than homemade soup which parses the text of the page to get what I need. Not really an answer to the question _per se_ as that was about empty `html_nodes` but it does give me the answer. – Nick Criswell Aug 31 '20 at 19:19
1

So I am going to provide an answer here which violates a very important rule described here and is generally an ugly solution. But it is a solution that allows us to avoid having to use Selenium.

To use html_nodes on this, we need to initiate JS actions which requires Selenium. @KWN's solution seems to work on their machine, but I can't get the chromedriver to work on mine. I can get almost there using Docker with Firefox or Chrome, but can't get the result. So I would check that solution out first. And if that fails, give this a shot. Pretty much, this site has the data I need exposed as a JSON. So I pull the text of the site which I use a regex to isolate the JSON and then jsonlite to parse.

library(jsonlite)
library(rvest)
library(tidyverse)

url <- "https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html"

html_res <- read_html(url)

# get text
text_res <- html_res %>% 
  html_text(trim = TRUE)

# find the area of interest
# find the area of interest
data1 <- str_extract_all(text_res, "(?<=var NYTG_schools = ).*(?=;)")[[1]]

# get json into data frame
json_res <- fromJSON(data1)

# did it work?
glimpse(json_res)

Rows: 1,515
Columns: 16
$ ipeds_id    <chr> "100663", "199120", "132903", "100751"...
$ nytname     <chr> "University of Alabama at Birmingham",...
$ shortname   <chr> "U.A.B.", "North Carolina", "Central F...
$ city        <chr> "Birmingham", "Chapel Hill", "Orlando"...
$ state       <chr> "Ala.", "N.C.", "Fla.", "Ala.", "Ala."...
$ county      <chr> "Jefferson", "Orange", "Orange", "Tusc...
$ fips        <chr> "01073", "37135", "12095", "01125", "0...
$ lat         <dbl> 33.50199, 35.90491, 28.60258, 33.21402...
$ long        <dbl> -86.80644, -79.04691, -81.20223, -87.5...
$ logo        <chr> "https://static01.nyt.com/newsgraphics...
$ infected    <int> 972, 835, 727, 568, 557, 509, 504, 500...
$ death       <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
$ dateline    <chr> "n", "n", "n", "n", "n", "n", "n", "n"...
$ ranking     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...
$ medicalnote <chr> "y", NA, NA, NA, NA, NA, NA, NA, NA, N...
$ coord       <list> [<847052.5, -406444.3>, <1508445.93, ...

Nick Criswell
  • 1,733
  • 2
  • 16
  • 32
  • 1
    this works too. To be less word, maybe try... data1 <- str_extract_all(text_res, "(?<=var NYTG_schools =).*(?=;)"). It uses regex to look for preceded and after but excluding the search terms – KKW Aug 31 '20 at 21:09
  • Thank you, that is slicker. Will make the change. – Nick Criswell Aug 31 '20 at 22:58