1

I have been trying to scrape a table for mapping analysis of facilities around the country. However, I can't seem to manage to

I have tried the code below as there and realize there is no html table available on this website.

url <- `https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page=`

table <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="views-form-resource-guide-results-page-1-results"]/div[1]') 

I am not sure if I am using the proper class for the XPath as I am getting a blank data frame. If I could also receive some guidance on iterating through all the pages of info, I would greatly appreciate it.

neuroneil
  • 13
  • 3
  • you are selecting for the first "row" Applied Behavior Analysis Fullerton, CA. What would your desired output look like? – QHarr Jul 11 '19 at 19:26
  • @Qharr My desired output would include all of the rows including the data on the next pages. – neuroneil Jul 11 '19 at 19:35

1 Answers1

0

I am new to R but something like the following where you define a function to retrieve the row info as a dataframe from a given url. Loop over how ever many pages you want calling the function and merging returned dfs into one big df. As nodeLists are not always the same length e.g. not every listing has a telephone number, you need to test for whether element is present in a loop over the rows. I use the method in the answer by alistaire (+ to him)

I am using css selectors rather than xpath. You can read about them here.

Given the # of possible pages I would look into using an http session. You get the efficiency of re-using a connection. I use them in other languages; from a quick google it seems R provides this, for example, with html_session.

I would welcome suggestions for improvement and any edits for correcting indentation. I'm learning as I go.

library(rvest)
library(magrittr)
library(purrr)


url <- "https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference?sort=status&direction=desc&page="

get_listings <- function(url){
    df <- read_html(url) %>% 
      html_nodes('.views-row') %>%
      map_df(~list(
                   title = html_node(.x, '.service-card__title a')%>% html_text(),
                   location = trimws(gsub('\n', ' ',html_text(html_node(.x, '.service-card__address')))) %>% 
                              {if(length(.) == 0) NA else .}, 
                    telephone = html_node(.x, '.service-card__phone') %>% html_text() %>% 
                              {if(length(.) == 0) NA else .}
                  )
             )
      return(df)
}

pages_to_loop = 2

for(i in seq(1, pages_to_loop)){
  new_url <- paste0(url, i, sep= '')
  if(i==1){
    df <-  get_listings(new_url)
  } else {
    new_df <- get_listings(new_url)
    df <- rbind(df, new_df)
  }
}
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Appreciate the help. Thanks so much! WIll definitely look into the resources you provided as really I want to master this art. – neuroneil Jul 12 '19 at 17:38