I am trying to scrape Public Health satistics data from https://prog.nfz.gov.pl/APP-JGP/KatalogJGP.aspx
The page shows two dependent dropdown lists. The source shows only the data for the top dropdown, the next will be filled dynamically with data (asp) only after i choose in the first dropdown. And only after I make the selection with both, I will see the list of links I would like to scrape.
My semiautomatic solution:to chose in both dropdowns, save the page and then scrape the list of links offline. My current code shows that I understand how to get data from comboboxes and from page once the selections are made:
#https://prog.nfz.gov.pl/app-jgp/
#From combo boxes I choose Year 2015 kat Ia and
#A group of procedures, eg. A1- xxxxxx
#Then I save the page locally to C:/myPage/
library(tidyverse)
library(rvest)
get_id <-
function (x, myString) {
require(stringr)
str_extract(x, paste0("(?i)(?<=", myString, "\\D)\\d+"))
}
# url <- "https://prog.nfz.gov.pl/APP-JGP/KatalogJGP.aspx"
url <- "C:/myPage/Narodowy Fundusz Zdrowia.htm"
pg <- read_html(url)
# The code below gets me the list from the first combo-box
# ContentPlaceHolder2_ddlSymylacjaJGP
n_values <-
html_nodes(pg, "#ContentPlaceHolder2_ddlSymylacjaJGP option") %>%
html_attr("value")
n_descr <-
html_nodes(pg, "#ContentPlaceHolder2_ddlSymylacjaJGP option") %>%
html_text()
selCatYear <- tibble(n_values,n_descr)
# The code below gets me the list from the second combo-box
#ContentPlaceHolder2_ddlKatalogJGP
n_values <-
html_nodes(pg, "#ContentPlaceHolder2_ddlKatalogJGP option") %>%
html_attr("value")
n_descr <-
html_nodes(pg, "#ContentPlaceHolder2_ddlKatalogJGP option") %>%
html_text()
setCatJGP <- tibble(n_values,n_descr)
# The code below gets me the list of urls I want.
# ContentPlaceHolder2_blKatalogJGP
n_values <-
html_nodes(pg, "#ContentPlaceHolder2_blKatalogJGP li a") %>%
html_attr("href")
n_descr <-
html_nodes(pg, "#ContentPlaceHolder2_blKatalogJGP li a") %>%
html_text()
setSingleJGP <- tibble(n_values,n_descr)
Question: How can I automate selecting in the combo-boxes (dependent, content created dynamically) so that I get the lists of links? Is it possible just with rvest and R? or other tool is needed (I have heard of Selenium but never used it).