0

I'm trying to scrape multiple pages from the same website from a gaming website for reviews.

I tried running it and altering the code I found on here: R web scraping across multiple pages with the one of the answers.

library(tidyverse)
library(rvest)

url_base <- "https://www.metacritic.com/browse/games/score/metascore/all/ps4?sort=desc&page=0"

map_df(1:17, function(i) {


  cat(".")

 pg <- read_html(sprintf(url_base, i))

data.frame(Name = html_text(html_nodes(pg,"#main .product_title a")),
         MetaRating = as.numeric(html_text(html_nodes(pg,"#main .positive"))),
         UserRating = as.numeric(html_text(html_nodes(pg,"#main .textscore"))),
         stringsAsFactors = FALSE)

}) -> ps4games_metacritic

The results is the first page is being scraped 17 times, instead of the 17 pages on the website

ha-pu
  • 581
  • 7
  • 19
Sam
  • 1
  • 1
    If you look at the answer you linked, you see that pagenumber was replaced by `%d`. So in your case you scrape page number 0, 17 times. Try `url_base <- "https://www.metacritic.com/browse/games/score/metascore/all/ps4?sort=desc&page=%d"` instead. – Tonio Liebrand Oct 10 '19 at 09:30
  • 2
    Possible duplicate of [R web scraping across multiple pages](https://stackoverflow.com/questions/36683510/r-web-scraping-across-multiple-pages) – Tonio Liebrand Oct 10 '19 at 09:30

1 Answers1

0

I have made three changes to your code:

  1. since their page numbering starts at 0, map_df(1:17... should be map_df(0:16...
  2. as proposed by BigDataScientist, url_base should be set like this: url_base <- "https://www.metacritic.com/browse/games/score/metascore/all/ps4?sort=desc&page=%d"
  3. if you use "#main .positive" you will get an error while scraping the 7th page, since games without positive scorese start there - unless you only want to scrape games with positive evaluations (which would mean a bit different code) you should use "#main .game" instead
    library(tidyverse)
    library(rvest)
    
    url_base <- "https://www.metacritic.com/browse/games/score/metascore/all/ps4?sort=desc&page=%d"
    
    map_df(0:16, function(i) {
      
      
      cat(".")
      pg <- read_html(sprintf(url_base, i))
    
      data.frame(Name = html_text(html_nodes(pg,"#main .product_title a")),
                 MetaRating = as.numeric(html_text(html_nodes(pg,"#main .game"))),
                 UserRating = as.numeric(html_text(html_nodes(pg,"#main .textscore"))),
                 stringsAsFactors = FALSE)
      
    }) -> ps4games_metacritic
ha-pu
  • 581
  • 7
  • 19