1

I'm trying to extract data from this website. I'm interested in extracting data from draft selections by year. The years range from 1963 to 2018.
There is a common pattern in url structure. For instance, its https://www.eliteprospects.com/draft/nhl-entry-draft/2018, https://www.eliteprospects.com/draft/nhl-entry-draft/2017 and so on.

So far, I've been successful in extracting the data for a single year. I've written a custom function wherein, given the input, the scraper will gather the data and present it in a nice looking data frame format.

library(rvest)
library (tidyverse)
library (stringr)
get_draft_data<- function(draft_type, draft_year){

  # replace the space between words in draft type with a '-'
  draft_types<- draft_type %>%
    # coerce to tibble format
    as.tibble() %>%
    set_names("draft_type") %>% 
    # replace the space between words in draft type with a '-'
    mutate(draft_type = str_replace_all(draft_type, " ", "-"))

  # create page url
  page <- stringr::str_c("https://www.eliteprospects.com/draft/", draft_types, "/", draft_year)%>%
    read_html()

  # Now scrape the team data from the page
  # Extract the team data
  draft_team<- page %>%

    html_nodes(".team") %>%
    html_text()%>%
    str_squish() %>%
    as_tibble()

  # Extract the player data
  draft_player<- page %>%

    html_nodes("#drafted-players .player") %>%
    html_text()%>%
    str_squish() %>%
    as_tibble()

  # Extract the seasons data
  draft_season<- page %>%

    html_nodes(".seasons") %>%
    html_text()%>%
    str_squish() %>%
    as_tibble()

# Join the dataframe's together. 
  all_data<- cbind(draft_team, draft_player,draft_season)  

  return(all_data)

} # end function

# Testing the function
draft_data<-get_draft_data("nhl entry draft", 2011)
glimpse(draft_data)
Observations: 212
Variables: 3
$ value <chr> "Team", "Edmonton Oilers", "Colorado Avalanche", "Florida Panth...
$ value <chr> "Player", "Ryan Nugent-Hopkins (F)", "Gabriel Landeskog (F)", "...
$ value <chr> "Seasons", "8", "8", "7", "8", "6", "8", "8", "8", "7", "7", "3...

Problem: How to craft code such that the year in the webpage url gets auto-incremented, enabling the scraper to extract the relevant data and write to a data frame.?

Note: I've already looked at some related questions like, 1, 2, 3, 4 but can't find my solution.

mnm
  • 1,962
  • 4
  • 19
  • 46

1 Answers1

1

I would just create a function that scrapes for a given year, then bind the rows for that year.

  1. Use paste() to create a dynamic url with the string and a variable for the year
  2. Write the scrape function for the url (Note: You don't have to use html_text -- it's stored as a table so it's may be directly extracted as such using html_table())
  3. Loop the function through years using lapply()
  4. Combine the dfs in the list using bind_rows()

Below is an example of this process for years 2010 to 2012.

library(rvest);library(tidyverse)


scrape.draft = function(year){

  url = paste("https://www.eliteprospects.com/draft/nhl-entry-draft/",year,sep="")

  out = read_html(url) %>%
    html_table(header = T) %>% '[['(2) %>%
    filter(!grepl("ROUND",GP)) %>%
    mutate(draftYear = year)

  return(out)

}

temp = lapply(2010:2012,scrape.draft) %>%
  bind_rows()
Gru
  • 95
  • 1
  • 7
  • thanks for bringing this **dead question** to life :-) continuing further if possible please **enhance your proposed approach** by adding code to fill missing values with NA. Besides, this trivial issue the code you proposed works, a +1 for it. And thanks for explaining the code workflow. It'll help others who might find your answer equally useful. – mnm May 11 '19 at 05:03