0

I'm a total rookie web scraper so apologies for the basic question, but I have searched around and struggled when trying to apply previous answers on here. I am trying to scrape multiple related URLs on fbref.com (a subset of Sports Reference) but running into an issue on I think using lapply properly. I can successfully pull one URL, just not all at once.

Here is the gist of what I'm trying to do:

library("rvest")
library("tidyverse")

year1 <- paste0(2006:2021)
year2 <- paste0(2007:2022)

urls <- sort(rep(paste0("https://fbref.com/en/comps/Big5/", year1, "-", year2, "/stats/players/", year1, "-", year2, "-Big-5-European-Leagues-Stats")))

table <- read_html(urls) |> 
  html_nodes("table") |> 
  html_table()

I think I just need to lapply loop that last section, but I am struggling to get the formatting right. When using the last section to read ONE of the URLs by purely pasting one URL, like below, I get the output I want. I simply want this for all years beginning with 2006-07 through 2021-22, in one csv file.

> url <- "https://fbref.com/en/comps/Big5/2021-2022/stats/players/2021-2022-Big-5-European-Leagues-Stats"
> table <- read_html(url) |> 
+     html_nodes("table") |> 
+     html_table()
> write.csv(table, file = "fbrefinitial.csv")

From there, I think I just need to use bind_rows along with either year1 or year2 to add a column for each year, as I would like to get this all in one tab of one csv file. (What's the right way to format that command?)

This is most similar to this post, but my attempts to apply that logic in different ways is not working.

Thank you for your help!

Bob H
  • 1

1 Answers1

2

You can do:

lapply(urls, function(url) {
  read_html(url) |> 
  html_nodes("table") |> 
  html_table()
})
#> [[1]]
#> [[1]][[1]]
#> # A tibble: 2,687 x 29
#>    ``    ``        ``    ``    ``    ``    ``    ``    Playi~1 Playi~2 Playi~3 Playi~4
#>    <chr> <chr>     <chr> <chr> <chr> <chr> <chr> <chr> <chr>   <chr>   <chr>   <chr>  
#>  1 Rk    Player    Nati~ Pos   Squad Comp  Age   Born  MP      Starts  Min     90s    
#>  2 1     Dani Aba~ es E~ FW,MF Celt~ es L~ 18    1987  1       0       13      0.1    
#>  3 2     Jacques ~ fr F~ DF    Nice  fr L~ 28    1978  30      28      2,492   27.7   
#>  4 3     Christia~ it I~ GK    Tori~ it S~ 29    1977  36      36      3,235   35.9   
#>  5 4     Pato Abb~ ar A~ GK    Geta~ es L~ 33    1972  36      36      3,215   35.7   
#>  6 5     Elvis Ab~ it I~ FW    Tori~ it S~ 25    1981  29      15      1,432   15.9   
#>  7 6     Nadjim A~ km C~ MF    Sedan fr L~ 22    1984  17      11      1,136   12.6   
#>  8 7     Nelson A~ uy U~ MF    Atal~ it S~ 33    1973  5       2       121     1.3    
#>  9 8     Mathias ~ de G~ DF    Hamb~ de B~ 25    1981  8       4       416     4.6    
#> 10 9     Éric Abi~ fr F~ DF    Lyon  fr L~ 26    1979  33      31      2,750   30.6   
#> # ... with 2,677 more rows, 17 more variables: Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>,
#> #   Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>,
#> #   `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>,
#> #   and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`,
#> #   3: `Playing Time`, 4: `Playing Time`
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
#> 
#> 
#> [[2]]
#> [[2]][[1]]
#> # A tibble: 2,770 x 29
#>    ``    ``        ``    ``    ``    ``    ``    ``    Playi~1 Playi~2 Playi~3 Playi~4
#>    <chr> <chr>     <chr> <chr> <chr> <chr> <chr> <chr> <chr>   <chr>   <chr>   <chr>  
#>  1 Rk    Player    Nati~ Pos   Squad Comp  Age   Born  MP      Starts  Min     90s    
#>  2 1     Jacques ~ fr F~ DF    Nice  fr L~ 29    1978  10      4       434     4.8    
#>  3 2     Jacques ~ fr F~ DF    Nürn~ de B~ 29    1978  10      9       820     9.1    
#>  4 3     Ignazio ~ it I~ DF,MF Empo~ it S~ 20    1986  24      9       1,167   13.0   
#>  5 4     Christia~ it I~ GK    Atlé~ es L~ 30    1977  21      20      1,804   20.0   
#>  6 5     Pato Abb~ ar A~ GK    Geta~ es L~ 34    1972  34      34      3,046   33.8   
#>  7 6     Yacine A~ ma M~ MF    Stra~ fr L~ 26    1981  23      17      1,549   17.2   
#>  8 7     Damià Ab~ es E~ DF,MF Betis es L~ 25    1982  26      24      2,230   24.8   
#>  9 8     Éric Abi~ fr F~ DF    Barc~ es L~ 27    1979  30      28      2,523   28.0   
#> 10 9     Ahmed Ab~ eg E~ DF,MF Stra~ fr L~ 26    1981  2       1       91      1.0    
#> # ... with 2,760 more rows, 17 more variables: Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>,
#> #   Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>,
#> #   `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>,
#> #   and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`,
#> #   3: `Playing Time`, 4: `Playing Time`
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
#> 
#> 
#> [[3]]
#> [[3]][[1]]
#> # A tibble: 2,796 x 29
#>    ``    ``        ``    ``    ``    ``    ``    ``    Playi~1 Playi~2 Playi~3 Playi~4
#>    <chr> <chr>     <chr> <chr> <chr> <chr> <chr> <chr> <chr>   <chr>   <chr>   <chr>  
#>  1 Rk    Player    Nati~ Pos   Squad Comp  Age   Born  MP      Starts  Min     90s    
#>  2 1     Jacques ~ fr F~ DF    Vale~ fr L~ 30    1978  18      14      1,252   13.9   
#>  3 2     Ignazio ~ it I~ DF,MF Tori~ it S~ 21    1986  25      21      1,913   21.3   
#>  4 3     Christia~ it I~ GK    Milan it S~ 31    1977  28      28      2,441   27.1   
#>  5 4     Pato Abb~ ar A~ GK    Geta~ es L~ 35    1972  13      13      1,083   12.0   
#>  6 5     Elvis Ab~ it I~ FW    Tori~ it S~ 27    1981  10      2       388     4.3    
#>  7 6     Djamel A~ dz A~ MF    Nant~ fr L~ 22    1986  22      12      1,139   12.7   
#>  8 7     Damià Ab~ es E~ DF,MF Betis es L~ 26    1982  25      20      1,788   19.9   
#>  9 8     Éric Abi~ fr F~ DF    Barc~ es L~ 28    1979  25      25      2,116   23.5   
#> 10 9     Fabrice ~ fr F~ MF    Lori~ fr L~ 29    1979  35      35      3,060   34.0   
#> # ... with 2,786 more rows, 17 more variables: Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>,
#> #   Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>,
#> #   `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>,
#> #   and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`,
#> #   3: `Playing Time`, 4: `Playing Time`
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
#> 

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Thanks! Still not quite getting it over the line, I think because I need to bind them correctly: > fbref_stats <- lapply(urls, function(url) { + read_html(url) |> + html_nodes("table") |> + html_table() + }) > write.csv(fbref_stats, file = "fbreftest4.csv") Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 2687, 2770, 2796, 2813, 2818, 2844, 2867, 2861, 2799, 2882, 2840, 2763, 2842, 2935, 3038 Trying bind_rows(fbref_stats, .id = "year2") and variants of that but can't get it quite right – Bob H Mar 11 '23 at 20:43
  • @BobH , assuming you are interested in 1 table per page, as `html_nodes("table")` returns list (even if there's just one element), `html_table()` also returns a list and your `fbref_stats` list will include one extra level. If you change `html_nodes` to `html_node`or use some other means to get just a single tibble per each lapply iteration, you should be able to use `bind_rows(fbref_stats)` or `do.call(rbind, fbref_stats)`. – margusl Mar 12 '23 at 09:39