scraping asp javascript paginated tables behind search with R

Question

i'm trying to pull the content on https://www.askebsa.dol.gov/epds/default.asp with either rvest or RSelenium but not finding guidance when the javascript page begins with a search box? it'd be great to just get all of this content into a simple CSV file.

after that, pulling the data from individual filings like https://www.askebsa.dol.gov/mewaview/View/Index/6219 seems possible.. but i'd also appreciate a clean recommendation to do that. thanks

score 2 · Answer 1 · answered Aug 13 '18 at 03:38

For the first part of the problem, this approach using rvest should work. I am receiving an error in the last step where it cannot find the required name-tag.

Here is my approach -

# open a html-session
web_session <- html_session("https://www.askebsa.dol.gov/epds/default.asp")
# get the form
test_search <- html_form(read_html("https://www.askebsa.dol.gov/epds/default.asp"))[[2]]

# set the required values for fields such as company_name, ein_number etc
# pass that info and submit the form - here i am getting an error 
# it cannot recognize the 'search button' name 
# if that is resolved it should work
set_values(test_search, 'm1company' = "Bend", 'm1ein' = '81-6268978' ) %>%
  submit_form(web_session, ., submit = "cmdSubmitM1") %>%
  read_html(.) -> some_html

If I get time I will try to do some more research and get back to you. I found a couple of tutorials and SO questions on similar topics here and here. They are a bit old but still useful.

For the second part its easier since you don't have any dynamic elements involved. I was able to retrieve all the addresses in the form by using the "selector-gadget" and copy pasting all the node names into the html_nodes() function.

# read the file and save it into a nested list
test_file_with_address <- read_html("https://www.askebsa.dol.gov/mewaview/View/Index/6219")

# copy paste all the css node names and get the text from the html file
test_file_with_address %>%
  html_nodes(".border-top:nth-child(19) code , .border-top:nth-child(18) code , .border-top:nth-child(14) code , .border-top:nth-child(13) code , .border-top:nth-child(12) code , .border-top:nth-child(11) code , .border-top:nth-child(9) code , .section-header+ .border-top code
") %>% html_text()

[1] "\r\n                Bend Chamber of Commerce Benefit Plan and Trust for Wood Products Employers\r\n                777 N.W. Wall Street, Suite 200\r\n                Bend,  OR  97703\r\n                \r\n                "
 [2] "(541) 382-3221"                                                                                                                                                                                                                
 [3] "81-6268978"                                                                                                                                                                                                                    
 [4] "501"                                                                                                                                                                                                                           
 [5] "\r\n                Bend Chamber of Commerce\r\n                777 N.W. Wall Street, Suite 200\r\n                Bend,  OR  97703\r\n                \r\n                "                                                   
 [6] "(541) 382-3221"                                                                                                                                                                                                                
 [7] "93-0331932"                                                                                                                                                                                                                    
 [8] "\r\n                Katy Brooks\r\n                Bend Chamber of Commerce\r\n                777 N.W. Wall Street, Suite 200\r\n                Bend,  OR  97703\r\n                \r\n                "                    
 [9] "(541) 382-3221"                                                                                                                                                                                                                
[10] "katy@bendchamber.org"                                                                                                                                                                                                          
[11] "\r\n                Deb Oster\r\n                Scott Logging/Scott Transportation\r\n                400 S.W. Bluff Drive, #101\r\n                Bend,  OR  97702\r\n                \r\n                "                 
[12] "(541) 536-3636"                                                                                                                                                                                                                
[13] "debo@scotttransport.com"                                                                                                                                                                                                       
[14] "\r\n                Karen Gibbons\r\n                Allen & Gibbons Logging\r\n                P.O. Box 754\r\n                Canyonville,  OR  97417\r\n                \r\n                "                               
[15] "(541) 839-4294"                                                                                                                                                                                                                
[16] "agibbonslog@frontiernet.net"                                                                                                                                                                                                   
[17] "\r\n                Cascade East Benefits\r\n                dba Johnson Benefit Planning\r\n                777 N.W. Wall Street, Suite 100\r\n                Bend,  OR  97703\r\n                \r\n                "      
[18] "(541) 382-3571"                                                                                                                                                                                                                
[19] "del@johnsonbenefitplanning.com"                                                                                                                                                                                                
[20] "93-1130374"                                                                                                                                                                                                                    
[21] "\r\n                PacificSource Health Plans\r\n                P.O. Box 7068\r\n                Springfield,  OR  97475-0068\r\n                \r\n                "                                                       
[22] "(541) 686-1242"                                                                                                                                                                                                                
[23] "george.sherwood@pacifcsource.com"                                                                                                                                                                                              
[24] "93-0245545"                                                                                                                                                                                                                    
[25] "\r\n                PacificSource Health Plans\r\n                P.O. Box 7068\r\n                Springfield,  OR  97475-0068\r\n                \r\n                "                                                       
[26] "(541) 686-1242"                                                                                                                                                                                                                
[27] "george.sherwood@pacificsource.com"                                                                                                                                                                                             
[28] "93-0245545"                                                                                                                                                                                                                    
[29] "N/A"

This requires some more regex magic to clean up and get them in a data.frame but the basic building blocks are there to see.

hi, thanks so much for taking the time! i am hitting server errors when i try this (on two separate windows machines). i tried switching over to `httr::RETRY("read_html",...)` but then `html_form` doesn't work. for the second step, not all detail-pages have the same structure, unfortunately. i had thought about parsing with `strsplit(...,"\\r\\n")` but it's inconsistent.. — Anthony Damico, Aug 13 '18 at 13:28
I ran this from a Mac (High Sierra v 10.13.6 ) using R 3.5.0. I can send the request to the website without any issues but the problem is mimicking the "click" action for the search button. If the detail-pages are different then I am unsure on how you can handle them with just a single function. — Suhas Hegde, Aug 13 '18 at 17:16

OzanStats · Answer 2 · 2018-08-14T06:21:08.547

Here is an example use of RSelenium to get links to individual filings. The rest should be straightforward once you retrieve links. You can navigate to these URLs using rvest (as you already did before) and parse the content with the help of string manipulation tools such as stringr. For the second part, it would be optimistic to expect a systematic structure across all forms. Please try spend some time to construct specific regular expressions to pull what you need from the text retrieved.

The code below may not necessarily be the most efficient solution to your problem but it includes right RSelenium concept and ideas. Feel free to tweak it based on your needs.

Additional info: RSelenium: Basics

# devtools::install_github("ropensci/RSelenium")
library(RSelenium)

# launch a remote driver 
driver <- rsDriver(browser=c("chrome"))
remDr <- driver[["client"]]

# select an URL
url <- "https://www.askebsa.dol.gov/epds/default.asp"

# navigate to the URL
remDr$navigate(url)

# choose year - option[2] corresponds to 2017
year <- remDr$findElements(using = 'xpath',  '//*[@id="m1year"]/option[2]')
year[[1]]$clickElement()

# choose company
company <- remDr$findElements(using = 'xpath',  '//*[@id="m1company"]')
company[[1]]$sendKeysToElement(list("Bend"))

# enter ein
ein <- remDr$findElements(using = 'xpath',  '//*[@id="m1ein"]')
ein[[1]]$sendKeysToElement(list("81-6268978"))

# sumbit the form to get the results
submit <- remDr$findElements(using = 'xpath',  '//*[@id="cmdSubmitM1"]')
submit[[1]]$clickElement()

# get the total number of results
num_of_results <- remDr$findElements(using = 'xpath',  '//*[@id="block-system-main"]/div/div/div/div/div/div[1]/form/table[1]/tbody/tr/td/div/b[1]')
n <- as.integer(num_of_results[[1]]$getElementText()[[1]])

# loop through results and print the links
for(i in 1:n) {
  xpath <- paste0('//*[@id="block-system-main"]/div/div/div/div/div/div[1]/form/table[3]/tbody/tr[', i + 1, ']/td[1]/a')
  link <- remDr$findElements('xpath', xpath)
  print(link[[1]]$getElementAttribute('href'))
}

# [[1]]
# [1] "https://www.askebsa.dol.gov/mewaview/View/Index/5589"
# 
# [[1]]
# [1] "https://www.askebsa.dol.gov/mewaview/View/Index/6219"

Please note that if you don't narrow down your search, you will get more than 50 results and therefore more than one page of results. In this case, you would need additional adjustments in the code (the structure of xpath inside for loop may change, you may need to navigate to extra pages, the loop should be limited to 50 iterations etc).

I believe this covers your actual problem, which was dynamic scraping. You may want to post your follow up questions separately as they include different concepts. There are a lot of regex experts out there who would help you parse those forms as long as you address this specific issue in a different question with suitable tags.

hi, sorry, to clarify, i definitely need more than the first 50 results. i think i need a solution that iterates through all search results.. any idea how to paginate? thanks for your time — Anthony Damico, Aug 14 '18 at 12:29
You can find the pagination button and click using the same strategy if you understand this code and take a look at the documentation. — OzanStats, Aug 14 '18 at 15:18
hi, sorry, when remove the year, company, ein filters, the code fails? pagination would be `submit <- remDr$findElements(using = 'xpath', '//*[@alt="Next"]') ; submit[[1]]$clickElement()` but the xpath seems to be thrown off by a broader search pattern? thanks again for the help — Anthony Damico, Aug 14 '18 at 18:51

t.m.adam · Accepted Answer · 2018-08-16T10:19:45.297

In order to get the results you'll have to fill in the form and submit it. You can find the url and field names by inspecting the html.

url <- "https://www.askebsa.dol.gov/epds/m1results.asp"

post_data <- list(
    m1year = 'ALL',         # Year
    m1company = '',         # Name of MEWA (starts with)
    m1ein = '',             # EIN
    m1state = 'ALL',        # State of MEWA Headquarters
    m1coverage = 'ALL',     # State(s) where MEWA offers coverage
    m1filingtype = 'ALL',   # Type of filing
    cmdSubmitM1 = 'Search',
    # hidden fields
    auth = 'Y', 
    searchtype = 'Q', 
    sf = 'EIN', 
    so = 'A'
)

Now we can submit the form and collect the links. We can scrape the links with this selector table.table.table-condensed td a.

html <- read_html(POST(url, body = post_data, encode = "form"))
links <- html_nodes(html, 'table.table.table-condensed td a') %>% html_attr("href") 
links <- paste0("https://www.askebsa.dol.gov", links)

This produces all the links of the first page.

Inspecting the HTTP traffic I noticed that the next page is loaded by submitting the same form with some extra fields (m1formid, allfilings, page). We can get the next pages by increasing the page value in a loop.

library(httr)
library(rvest)

url <- "https://www.askebsa.dol.gov/epds/m1results.asp"
post_data <- list(
    m1year='ALL', m1company='', m1ein='', m1state='all', 
    m1coverage='all', m1filingtype='ALL', cmdSubmitM1 = 'Search',
    auth='Y', searchtype='Q', sf='EIN', so='A', 
    m1formid='', allfilings='', page=1
)
links = list()

while (TRUE) {
    html <- read_html(POST(url, body = post_data, encode = "form"))
    page_links <- html_nodes(html, 'table.table.table-condensed td a') %>% html_attr("href") %>% paste0("https://www.askebsa.dol.gov/", .) 
    links <- c(links, page_links)
    last <- html_text(tail(html_nodes(html, 'div.textnorm > a'), n=2)[1])
    if (last != 'Last') {
        break
    }
    post_data['page'] <- post_data[['page']] + 1
}

print(links)

For the second part of the question, I assume that the goal is to select the form items and their values. You could do that by selecting all div.question-inline tags and the next code tag for each item.

library(rvest)

url <- "https://www.askebsa.dol.gov/mewaview/View/Index/6219"
nodes <- html_nodes(read_html(url), 'div.question-inline, div.question')
data <- list()

for (i in nodes) {
    n = trimws(html_text(html_node(i, xpath='./text()')))

    if (length(html_nodes(i, 'code')) == 0) {
        text <- html_nodes(i, xpath = '../address/code/text()')
        v <- paste(trimws(text), collapse = '\r\n')
    } else {
        v <- html_text(html_nodes(i, 'code'))
    }
    data[[n]] = v
}

print(data)

This code produces a named list with all the form items, but can be modified to produce a nested list or a more appropriate structure.
At this point I must say that I have very little experience with R, so this code is propably not a good coding example. Any tips or other comments are very welcome.

scraping asp javascript paginated tables behind search with R

3 Answers3