How to scrape data from multiple pages by dynamically updating the url with rvest

Question

I'm trying to extract data from this website. I'm interested in extracting data from draft selections by year. The years range from 1963 to 2018.
There is a common pattern in url structure. For instance, its https://www.eliteprospects.com/draft/nhl-entry-draft/2018, https://www.eliteprospects.com/draft/nhl-entry-draft/2017 and so on.

So far, I've been successful in extracting the data for a single year. I've written a custom function wherein, given the input, the scraper will gather the data and present it in a nice looking data frame format.

library(rvest)
library (tidyverse)
library (stringr)
get_draft_data<- function(draft_type, draft_year){

  # replace the space between words in draft type with a '-'
  draft_types<- draft_type %>%
    # coerce to tibble format
    as.tibble() %>%
    set_names("draft_type") %>% 
    # replace the space between words in draft type with a '-'
    mutate(draft_type = str_replace_all(draft_type, " ", "-"))

  # create page url
  page <- stringr::str_c("https://www.eliteprospects.com/draft/", draft_types, "/", draft_year)%>%
    read_html()

  # Now scrape the team data from the page
  # Extract the team data
  draft_team<- page %>%

    html_nodes(".team") %>%
    html_text()%>%
    str_squish() %>%
    as_tibble()

  # Extract the player data
  draft_player<- page %>%

    html_nodes("#drafted-players .player") %>%
    html_text()%>%
    str_squish() %>%
    as_tibble()

  # Extract the seasons data
  draft_season<- page %>%

    html_nodes(".seasons") %>%
    html_text()%>%
    str_squish() %>%
    as_tibble()

# Join the dataframe's together. 
  all_data<- cbind(draft_team, draft_player,draft_season)  

  return(all_data)

} # end function

# Testing the function
draft_data<-get_draft_data("nhl entry draft", 2011)
glimpse(draft_data)
Observations: 212
Variables: 3
$ value <chr> "Team", "Edmonton Oilers", "Colorado Avalanche", "Florida Panth...
$ value <chr> "Player", "Ryan Nugent-Hopkins (F)", "Gabriel Landeskog (F)", "...
$ value <chr> "Seasons", "8", "8", "7", "8", "6", "8", "8", "8", "7", "7", "3...

Problem: How to craft code such that the year in the webpage url gets auto-incremented, enabling the scraper to extract the relevant data and write to a data frame.?

Note: I've already looked at some related questions like, 1, 2, 3, 4 but can't find my solution.

score 1 · Accepted Answer · answered May 10 '19 at 16:14

I would just create a function that scrapes for a given year, then bind the rows for that year.

Use paste() to create a dynamic url with the string and a variable for the year
Write the scrape function for the url (Note: You don't have to use html_text -- it's stored as a table so it's may be directly extracted as such using html_table())
Loop the function through years using lapply()
Combine the dfs in the list using bind_rows()

Below is an example of this process for years 2010 to 2012.

library(rvest);library(tidyverse)


scrape.draft = function(year){

  url = paste("https://www.eliteprospects.com/draft/nhl-entry-draft/",year,sep="")

  out = read_html(url) %>%
    html_table(header = T) %>% '[['(2) %>%
    filter(!grepl("ROUND",GP)) %>%
    mutate(draftYear = year)

  return(out)

}

temp = lapply(2010:2012,scrape.draft) %>%
  bind_rows()

thanks for bringing this **dead question** to life :-) continuing further if possible please **enhance your proposed approach** by adding code to fill missing values with NA. Besides, this trivial issue the code you proposed works, a +1 for it. And thanks for explaining the code workflow. It'll help others who might find your answer equally useful. — mnm, May 11 '19 at 05:03

How to scrape data from multiple pages by dynamically updating the url with rvest

1 Answers1