loop across multiple urls in r with rvest

Question

I have a series of 9 urls that I would like to scrape data from:

http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0

The offset= at the end of the link goes from 0 up to 900 (by 100) when pages change through the last page. I would like to loop through each page and scrape each table, then use rbind to stack each df on top of one another in sequence. I have been using rvest and would like to use lapply since I am better with that than for loops.

The question is similar to this (Harvest (rvest) multiple HTML pages from a list of urls) but different because I would prefer not to have to copy all the links to one vector before running the program. I would like a general solution to how to loop over multiple pages and harvest the data, creating a data frame each time.

The following works for the first page:

library(rvest)
library(stringr)
library(tidyr)

site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0' 

webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]

But I would like to repeat this over all pages without having to paste the urls into a vector. I tried the following and it didn't work:

jump <- seq(0, 900, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=', jump,'.htm', sep="")

webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]

So there should be a data frame for each page and I imagine it would be easier to put them in a list and then use rbind to stack them.

Any help would be greatly appreciated!

@HubertL yes just edited the question above. The first chunk of code produces one data frame — jvalenti, Nov 17 '16 at 23:03
Here is another potential solution: http://stackoverflow.com/questions/39129125/how-to-scrape-all-pages-1-2-3-n-from-a-website-using-r-vest/39131227#39131227 — Dave2e, Nov 17 '16 at 23:25
In the second version, `site` is a vector of URLs, so this is a dupe. — alistaire, Nov 18 '16 at 00:03
I would try `download.file` with mode="a" and then read all the data from a single disk file. — IRTFM, Nov 18 '16 at 01:05
@alistaire I don't think this is a dupe because the other question uses a pre-constructed vector of urls, while this one asks how to perform a similar task without already having all the urls in one place. — jvalenti, Nov 21 '16 at 19:55
You've already shown how to construct the vector of URLs above; that's not the part that's not working. The part you need is to `lapply`/`purrr::map` across the vector `site`, of which the dupe shows an example, and of [which](http://stackoverflow.com/q/33771265/4497050) [there](http://stackoverflow.com/q/28823270/4497050) [are](http://stackoverflow.com/q/30586480/4497050) [many](http://stackoverflow.com/questions/40140133/scraping-tables-on-multiple-web-pages-with-rvest-in-r) [more](http://stackoverflow.com/q/36683510/4497050) [examples](http://stackoverflow.com/q/38114066/4497050). — alistaire, Nov 21 '16 at 20:32

Parfait · Accepted Answer · 2016-11-21T20:31:33.917

7

You are attempting to vectorize a method that cannot take multiple items in one call. Specifically, read_html() requires one page per call since it needs to read in web data one at a time and expects a scalar value. Consider looping through the site list with lapply then bind all dfs together:

jump <- seq(0, 800, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?',
              'request=1&year_min=2001&year_max=2014&round_min=&round_max=',
              '&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0',
              '&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y',
              '&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=',
              '&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id',
              '&order_by_asc=&offset=', jump, sep="")

dfList <- lapply(site, function(i) {
    webpage <- read_html(i)
    draft_table <- html_nodes(webpage, 'table')
    draft <- html_table(draft_table)[[1]]
})

finaldf <- do.call(rbind, dfList)             # ASSUMING ALL DFs MAINTAIN SAME COLS

edited Nov 21 '16 at 20:31

answered Nov 18 '16 at 04:19

Parfait

104,375
17
94
125

this is really intuitive but I keep getting an error "subscript is out of bounds" when I run it. – jvalenti Nov 18 '16 at 21:31
At what line do you receive error? Does `dfList` populate? – Parfait Nov 18 '16 at 23:21
it appears the error comes from the final line of dfList. dfList does not populate – jvalenti Nov 18 '16 at 23:23
Just checked. Turns out `900` has no table search results. Try leaving that number out. Highest rank is `831`. – Parfait Nov 18 '16 at 23:43
Does error persist with leaving out last url with `900`? – Parfait Nov 19 '16 at 20:11
yes, it still persists. Even dropping the last url with 900 returns the following error message: "Error in html_table(draft_table)[[1]] : subscript out of bounds" – jvalenti Nov 21 '16 at 19:53
Just found the issue. No need for `.htm` at the end and still leave out 900. – Parfait Nov 21 '16 at 20:33

score 2 · Answer 2 · answered Nov 18 '16 at 01:23

You can use curl to run all of the requests at once. I Be nice to the sites that may have small servers and don't blow them up. With this code you can use the lapply at the end to clean up the table so you can stack it with do.call(rbind, AllOut) but I will leave that to you.

library(rvest)
library(stringr)
library(tidyr)

OffSet <- seq(0, 900, by = 100)

Sites <- paste0('http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=', OffSet)


library(curl)

out <<- list()
# This is function, function which will be run if data vendor call is successful
complete = function(res){
  # cat("Request done! Status:", res$status, "\n")
  out <<- c(out, list(res))
}

for(i in 1:length(Sites)){
  curl_fetch_multi(
    Sites[i]
    , done = complete
    , fail = print
    , handle = new_handle(customrequest = "GET")
    )
}

multi_run()

AllOut <- lapply(out, function(x){

  webpage <- read_html(x$content)
  draft_table <- html_nodes(webpage, 'table')
  Tab <- html_table(draft_table)
  if(length(Tab) == 0){
    NULL
  } else {
    Tab
  }

})

loop across multiple urls in r with rvest

2 Answers2

Linked