This is more of a blog post or tutorial than an SO answer but I can also appreciate the desire to learn and am also working on a book for this very topic and this seems like a gd example.
library(rvest)
library(tidyverse)
We'll start with the top-level page:
pg <- read_html("https://www.eia.gov/naturalgas/archive/petrosystem/petrosysog.html")
Now, we'll use an XPath that only gets us table rows that have state data in them. Compare the XPath expressions to the tags in the HTML and this should make sense. Find all <tr>
s without colspan
attributes and only choose remaining <tr>
s that have both the right class and a link to a state:
states <- html_nodes(pg, xpath=".//tr[td[not(@colspan) and
contains(@class, 'links_normal') and a[@name]]]")
data_frame(
state = html_text(html_nodes(states, xpath=".//td[1]")),
link = html_attr(html_nodes(states, xpath=".//td[2]/a"), "href")
) -> state_tab
It's in a data frame to keep it tidy and handy.
You'll need to put the next bit below the function that comes after it, but I need to explain the iteration before showing the function.
We need to iterate over each link. In each iteration, we:
- pause since your needs aren't more important than EIA's server load
- find all "branch"
<div>
s since they hold two pieces of information we need (the state+year and the data table for said state+year).
- wrap it all up in a nice data frame
Rather than clutter up the anonymous function, we'll put that functionality in another function (again, which needs to be defined before this iterator will work):
pb <- progress_estimated(nrow(state_tab))
map_df(state_tab$link, ~{
pb$tick()$print()
pg <- read_html(sprintf("https://www.eia.gov/naturalgas/archive/petrosystem/%s", .x))
Sys.sleep(5) # scrape responsibly
html_nodes(pg, xpath=".//div[@class='branch']") %>%
map_df(extract_table)
}) -> og_df
This is the hard worker of the bunch. We need to find all the State + Year labels on the page (each are in a <table>
) then we need to find the tables with data in them. I take the liberty of removing the explanatory blurb at the bottom of each and also turn each into a tibble
(but that's just my class preference):
extract_table <- function(pg) {
t1 <- html_nodes(pg, xpath=".//../tr[td[contains(@class, 'SystemTitle')]][1]")
t2 <- html_nodes(pg, xpath=".//table[contains(@summary, 'Report')]")
state_year <- (html_text(t1, trim=TRUE) %>% strsplit(" "))[[1]]
xml_find_first(t2, "td[@colspan]") %>% xml_remove()
html_table(t2, header=FALSE)[[1]] %>%
mutate(state=state_year[1], year=state_year[2]) %>%
tbl_df()
}
Re-pasting the aforeposted code just to ensure you grok it has to come after the function:
pb <- progress_estimated(nrow(state_tab))
map_df(state_tab$link, ~{
pb$tick()$print()
pg <- read_html(sprintf("https://www.eia.gov/naturalgas/archive/petrosystem/%s", .x))
Sys.sleep(5) # scrape responsibly
html_nodes(pg, xpath=".//div[@class='branch']") %>%
map_df(extract_table)
}) -> og_df
And, it works (you said you'd do the final cleanup separately):
glimpse(og_df)
## Observations: 14,028
## Variables: 19
## $ X1 <chr> "", "Prod.RateBracket(BOE/Day)", "0 - 1", "1 - 2", "2 - 4", "4 - 6", "...
## $ X2 <chr> "", "||||", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|"...
## $ X3 <chr> "Oil Wells", "# ofOilWells", "26", "19", "61", "61", "47", "36", "250"...
## $ X4 <chr> "Oil Wells", "% ofOilWells", "5.2", "3.8", "12.1", "12.1", "9.3", "7.1...
## $ X5 <chr> "Oil Wells", "AnnualOilProd.(Mbbl)", "4.1", "7.8", "61.6", "104.9", "1...
## $ X6 <chr> "Oil Wells", "% ofOilProd.", "0.1", "0.2", "1.2", "2.1", "2.2", "2.3",...
## $ X7 <chr> "Oil Wells", "OilRateper Well(bbl/Day)", "0.5", "1.4", "3.0", "4.9", "...
## $ X8 <chr> "Oil Wells", "AnnualGasProd.(MMcf)", "1.5", "3.5", "16.5", "19.9", "9....
## $ X9 <chr> "Oil Wells", "GasRateper Well(Mcf/Day)", "0.2", "0.6", "0.8", "0.9", "...
## $ X10 <chr> "", "||||", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|", "|"...
## $ X11 <chr> "Gas Wells", "# ofGasWells", "365", "331", "988", "948", "867", "674",...
## $ X12 <chr> "Gas Wells", "% ofGasWells", "5.9", "5.4", "16.0", "15.4", "14.1", "10...
## $ X13 <chr> "Gas Wells", "AnnualGasProd.(MMcf)", "257.6", "1,044.3", "6,360.6", "1...
## $ X14 <chr> "Gas Wells", "% ofGasProd.", "0.1", "0.4", "2.6", "4.2", "5.3", "5.4",...
## $ X15 <chr> "Gas Wells", "GasRateper Well(Mcf/Day)", "2.2", "9.2", "18.1", "30.0",...
## $ X16 <chr> "Gas Wells", "AnnualOilProd.(Mbbl)", "0.2", "0.6", "1.6", "2.0", "2.4"...
## $ X17 <chr> "Gas Wells", "OilRateper Well(bbl/Day)", "0.0", "0.0", "0.0", "0.0", "...
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala...
## $ year <chr> "2009", "2009", "2009", "2009", "2009", "2009", "2009", "2009", "2009"...