1

I have reviewed several answers to similar questions on SO related to this similar topic but neither seem to work for me.

(loop across multiple urls in r with rvest)

(Harvest (rvest) multiple HTML pages from a list of urls)

I have a list of URLs and I wish to grab the table from each and append it to a master dataframe.

## get all urls into one list
page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
  url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
  urls[[i]] <- url
}


### loop over the urls and get the table from each page
table<- data.frame()
for (j in urls) {
  tbl<- urls[j] %>% 
    read_html() %>% 
    html_node("table") %>%
    html_table()
  table[[j]] <- tbl
}

The first section works as expect and gets the list of urls I want to scrape. I get the following error:

 Error in UseMethod("read_xml") : 
  no applicable method for 'read_xml' applied to an object of class "list"

Any suggestions on how to get correct for this error and get the 3 tables looped into a single DF? I appreciate any tips or pointers.

cowboy
  • 613
  • 5
  • 20
  • Have you tried assigning `j <- 1` outside the for loop then `j <- j+1` inside your for loop after `table[[j]] <- tbl` argument. – On_an_island Dec 08 '18 at 03:48

2 Answers2

3

Try this:

library(tidyverse)
library(rvest)

page<- (0:2)
urls <- list()
for (i in 1:length(page)) {
  url<- paste0("https://www.mlssoccer.com/stats/season?page=",page[i])
  urls[[i]] <- url
}

### loop over the urls and get the table from each page
tbl <- list()
j <- 1
for (j in seq_along(urls)) {
  tbl[[j]] <- urls[[j]] %>%   # tbl[[j]] assigns each table from your urls as an element in the tbl list
    read_html() %>% 
    html_node("table") %>%
    html_table()
  j <- j+1                    # j <- j+1 iterates over each url in turn and assigns the table from the second url as an element of tbl list, [[2]] in this case
}

#convert list to data frame
tbl <- do.call(rbind, tbl)

table[[j]] <- tbl at the end of your for loop in the original code was not necessary as we're assigning each url as an element of the tbl list here: tbl[[j]] <- urls[[j]]

On_an_island
  • 387
  • 3
  • 16
  • Thanks @on_an_island. Exactly the output I was looking for. – cowboy Dec 08 '18 at 17:47
  • i know is not the best way ask here, but, how could i add a column to the table, showing what "j" was used? (i'm adapting the solution) – Adilson V Casula Jan 06 '20 at 22:07
  • 1
    @AdilsonVCasula just add `tbl[[j]]$j <- j` directly before `j <- j+1`. This will append a column with the `j`th value. Or you could mutate the column by replacing `html_table()` with `html_table() %>% mutate(j = j)`. – On_an_island Jan 07 '20 at 22:21
2

Here is your problem:

for (j in urls) {
  tbl<- urls[j] %>% 

When you use j in urls the j values are not integers, they are the urls themselves.

Try:

for (j in 1:length(urls)) {
  tbl<- urls[[j]] %>% 
    read_html() %>% 
    html_node("table") %>%
    html_table()
  table[[j]] <- tbl
}

You can also use seq_along():

for (j in seq_along(urls))
R. Schifini
  • 9,085
  • 2
  • 26
  • 32
  • that seems to fix the original issue, however it creates a new error which I don't understand. " Error in [[<-.data.frame(*tmp*, j, value = list(list(Player = c("Diego Rubio", : replacement has 1 row, data has 0 " – Brad_J 25 mins ago – cowboy Dec 08 '18 at 17:06