4

I have a series of 9 urls that I would like to scrape data from:

http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0 

The offset= at the end of the link goes from 0 up to 900 (by 100) when pages change through the last page. I would like to loop through each page and scrape each table, then use rbind to stack each df on top of one another in sequence. I have been using rvest and would like to use lapply since I am better with that than for loops.

The question is similar to this (Harvest (rvest) multiple HTML pages from a list of urls) but different because I would prefer not to have to copy all the links to one vector before running the program. I would like a general solution to how to loop over multiple pages and harvest the data, creating a data frame each time.

The following works for the first page:

library(rvest)
library(stringr)
library(tidyr)

site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0' 

webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]

But I would like to repeat this over all pages without having to paste the urls into a vector. I tried the following and it didn't work:

jump <- seq(0, 900, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=', jump,'.htm', sep="")

webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]

So there should be a data frame for each page and I imagine it would be easier to put them in a list and then use rbind to stack them.

Any help would be greatly appreciated!

Community
  • 1
  • 1
jvalenti
  • 604
  • 1
  • 9
  • 31
  • do you manage to harvest even the first page? – HubertL Nov 17 '16 at 22:55
  • @HubertL yes just edited the question above. The first chunk of code produces one data frame – jvalenti Nov 17 '16 at 23:03
  • 2
    Here is another potential solution: http://stackoverflow.com/questions/39129125/how-to-scrape-all-pages-1-2-3-n-from-a-website-using-r-vest/39131227#39131227 – Dave2e Nov 17 '16 at 23:25
  • In the second version, `site` is a vector of URLs, so this is a dupe. – alistaire Nov 18 '16 at 00:03
  • I would try `download.file` with mode="a" and then read all the data from a single disk file. – IRTFM Nov 18 '16 at 01:05
  • @alistaire I don't think this is a dupe because the other question uses a pre-constructed vector of urls, while this one asks how to perform a similar task without already having all the urls in one place. – jvalenti Nov 21 '16 at 19:55
  • You've already shown how to construct the vector of URLs above; that's not the part that's not working. The part you need is to `lapply`/`purrr::map` across the vector `site`, of which the dupe shows an example, and of [which](http://stackoverflow.com/q/33771265/4497050) [there](http://stackoverflow.com/q/28823270/4497050) [are](http://stackoverflow.com/q/30586480/4497050) [many](http://stackoverflow.com/questions/40140133/scraping-tables-on-multiple-web-pages-with-rvest-in-r) [more](http://stackoverflow.com/q/36683510/4497050) [examples](http://stackoverflow.com/q/38114066/4497050). – alistaire Nov 21 '16 at 20:32

2 Answers2

7

You are attempting to vectorize a method that cannot take multiple items in one call. Specifically, read_html() requires one page per call since it needs to read in web data one at a time and expects a scalar value. Consider looping through the site list with lapply then bind all dfs together:

jump <- seq(0, 800, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?',
              'request=1&year_min=2001&year_max=2014&round_min=&round_max=',
              '&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0',
              '&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y',
              '&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=',
              '&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id',
              '&order_by_asc=&offset=', jump, sep="")

dfList <- lapply(site, function(i) {
    webpage <- read_html(i)
    draft_table <- html_nodes(webpage, 'table')
    draft <- html_table(draft_table)[[1]]
})

finaldf <- do.call(rbind, dfList)             # ASSUMING ALL DFs MAINTAIN SAME COLS
Parfait
  • 104,375
  • 17
  • 94
  • 125
2

You can use curl to run all of the requests at once. I Be nice to the sites that may have small servers and don't blow them up. With this code you can use the lapply at the end to clean up the table so you can stack it with do.call(rbind, AllOut) but I will leave that to you.

library(rvest)
library(stringr)
library(tidyr)

OffSet <- seq(0, 900, by = 100)

Sites <- paste0('http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=', OffSet)


library(curl)

out <<- list()
# This is function, function which will be run if data vendor call is successful
complete = function(res){
  # cat("Request done! Status:", res$status, "\n")
  out <<- c(out, list(res))
}

for(i in 1:length(Sites)){
  curl_fetch_multi(
    Sites[i]
    , done = complete
    , fail = print
    , handle = new_handle(customrequest = "GET")
    )
}

multi_run()

AllOut <- lapply(out, function(x){

  webpage <- read_html(x$content)
  draft_table <- html_nodes(webpage, 'table')
  Tab <- html_table(draft_table)
  if(length(Tab) == 0){
    NULL
  } else {
    Tab
  }

})
JackStat
  • 1,593
  • 1
  • 11
  • 17