In order to get the results you'll have to fill in the form and submit it. You can find the url and field names by inspecting the html.
url <- "https://www.askebsa.dol.gov/epds/m1results.asp"
post_data <- list(
m1year = 'ALL', # Year
m1company = '', # Name of MEWA (starts with)
m1ein = '', # EIN
m1state = 'ALL', # State of MEWA Headquarters
m1coverage = 'ALL', # State(s) where MEWA offers coverage
m1filingtype = 'ALL', # Type of filing
cmdSubmitM1 = 'Search',
# hidden fields
auth = 'Y',
searchtype = 'Q',
sf = 'EIN',
so = 'A'
)
Now we can submit the form and collect the links. We can scrape the links with this selector table.table.table-condensed td a
.
html <- read_html(POST(url, body = post_data, encode = "form"))
links <- html_nodes(html, 'table.table.table-condensed td a') %>% html_attr("href")
links <- paste0("https://www.askebsa.dol.gov", links)
This produces all the links of the first page.
Inspecting the HTTP traffic I noticed that the next page is loaded by submitting the same form with some extra fields (m1formid, allfilings, page). We can get the next pages by increasing the page value in a loop.
library(httr)
library(rvest)
url <- "https://www.askebsa.dol.gov/epds/m1results.asp"
post_data <- list(
m1year='ALL', m1company='', m1ein='', m1state='all',
m1coverage='all', m1filingtype='ALL', cmdSubmitM1 = 'Search',
auth='Y', searchtype='Q', sf='EIN', so='A',
m1formid='', allfilings='', page=1
)
links = list()
while (TRUE) {
html <- read_html(POST(url, body = post_data, encode = "form"))
page_links <- html_nodes(html, 'table.table.table-condensed td a') %>% html_attr("href") %>% paste0("https://www.askebsa.dol.gov/", .)
links <- c(links, page_links)
last <- html_text(tail(html_nodes(html, 'div.textnorm > a'), n=2)[1])
if (last != 'Last') {
break
}
post_data['page'] <- post_data[['page']] + 1
}
print(links)
For the second part of the question, I assume that the goal is to select the form items and their values. You could do that by selecting all div.question-inline
tags and the next code
tag for each item.
library(rvest)
url <- "https://www.askebsa.dol.gov/mewaview/View/Index/6219"
nodes <- html_nodes(read_html(url), 'div.question-inline, div.question')
data <- list()
for (i in nodes) {
n = trimws(html_text(html_node(i, xpath='./text()')))
if (length(html_nodes(i, 'code')) == 0) {
text <- html_nodes(i, xpath = '../address/code/text()')
v <- paste(trimws(text), collapse = '\r\n')
} else {
v <- html_text(html_nodes(i, 'code'))
}
data[[n]] = v
}
print(data)
This code produces a named list with all the form items, but can be modified to produce a nested list or a more appropriate structure.
At this point I must say that I have very little experience with R, so this code is propably not a good coding example. Any tips or other comments are very welcome.