I have some experience using the rvest
package to scrape data I need from the web, but am hitting an issue with this page:
https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html
If you scroll down a bit, you'll see a portion where all the schools are located.
I would like to have the school, cases and location data. I should note that someone asked on the NYT GitHub publishing this as a csv and they recommended that the data is all in the page and can just be pulled from there. Thus, I think it is OK to scrape from this page.
But I can't get it to work. Let's say I just want to start with a simple selector for that first school. I use the inspector to find the xpath.
I get no results:
library(rvest)
URL <- "https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html"
pg <- read_html(URL)
# xpath copied from inspector
xpath_first_school <- '//*[@id="school100663"]'
node_first_school <- html_node(pg, xpath = xpath_first_school)
> node_first_school
{xml_missing}
<NA>
I get {xml_missing}
.
I obliviously have a lot more to do to generalize this and gather data for all schools, but with web scraping, I usually try to start simple and specific and then broaden out. But even my simple test isn't working. Any ideas?