Scraping reviews from Multiple pages in R

Question

I was struggling to get the scraping done on a web page. My task is to scrape the reviews from the website and run a sentiment analysis on it. But I have only managed to get the Scraping done on the first page, How can I scrape all the reviews of the same movie distributed on multiple pages.

This is my code:

library(rvest)

read_html("https://www.rottentomatoes.com/m/dune_2021/reviews") %>%
  html_elements(xpath = "//div[@class='the_review']") %>% 
  html_text2()

This only gets me the reviews from the first page but I need reviews from all the pages. Any help would be highly appreciated.

For some inspiration: [here is how to do it in python](https://stackoverflow.com/questions/69963743/scraping-all-reviews-of-a-movie-from-rotten-tomato-using-soup) — BHudson, May 29 '22 at 16:01
Would you please see that , it may help [When Normal Web Scraping Just Won’t Work](https://www.r-bloggers.com/2021/05/scraping-google-play-reviews-with-rselenium/) — Mohamed Desouky, May 29 '22 at 16:42
Thanks all I have obtained what I needed using the Rselenium Package. — M. Talha Bin Asif, May 29 '22 at 21:43

score 1 · Accepted Answer · answered May 30 '22 at 05:12

You could avoid the expensive overhead of a browser and use httr2. The page uses a queryString GET request to grab the reviews in batches. For each batch, the offset parameters of startCursor and endCursor can be picked up from the previous request, as well as there being a hasNextPage flag field which can be used to terminate requests for additional reviews. For the initial request, the title id needs to be picked up and the offset parameters can be set as ''.

After collecting all reviews, in a list in my case, I apply a custom function to extract some items of possible interest from each review to generate a final dataframe.

Acknowledgments: I took the idea of using repeat() from @flodal here

library(tidyverse)
library(httr2)

get_reviews <- function(results, n) {
  r <- request("https://www.rottentomatoes.com/m/dune_2021/reviews") %>%
    req_headers("user-agent" = "mozilla/5.0") %>%
    req_perform() %>%
    resp_body_html() %>%
    toString()

  title_id <- str_match(r, '"titleId":"(.*?)"')[, 2]
  start_cursor <- ""
  end_cursor <- ""

  repeat {
    r <- request(sprintf("https://www.rottentomatoes.com/napi/movie/%s/criticsReviews/all/:sort", title_id)) %>%
      req_url_query(f = "", direction = "next", endCursor = end_cursor, startCursor = start_cursor) %>%
      req_perform() %>%
      resp_body_json()
    results[[n]] <- r$reviews
    nextPage <- r$pageInfo$hasNextPage

    if (!nextPage) break

    start_cursor <- r$pageInfo$startCursor
    end_cursor <- r$pageInfo$endCursor
    n <- n + 1
  }
  return(results)
}

n <- 1
results <- list()  
data <- get_reviews(results, n)

df <- purrr::map_dfr(data %>% unlist(recursive = F), ~
data.frame(
  date = .x$creationDate,
  reviewer = .x$publication$name,
  url = .x$reviewUrl,
  quote = .x$quote,
  score = if (is.null(.x$scoreOri)) {
    NA_character_
  } else {
    .x$scoreOri
  },
  sentiment = .x$scoreSentiment
))

This is top quality work, Thanks a lot mate for helping me out here. This is a more clean and more efficient method than Rselenium @QHarr — M. Talha Bin Asif, May 30 '22 at 11:29

Scraping reviews from Multiple pages in R

1 Answers1