You could avoid the expensive overhead of a browser and use httr2. The page uses a queryString GET request to grab the reviews in batches. For each batch, the offset parameters of startCursor and endCursor can be picked up from the previous request, as well as there being a hasNextPage flag field which can be used to terminate requests for additional reviews. For the initial request, the
title id needs to be picked up and the offset parameters can be set as ''.
After collecting all reviews, in a list in my case, I apply a custom function to extract some items of possible interest from each review to generate a final dataframe.
Acknowledgments: I took the idea of using repeat()
from @flodal here
library(tidyverse)
library(httr2)
get_reviews <- function(results, n) {
r <- request("https://www.rottentomatoes.com/m/dune_2021/reviews") %>%
req_headers("user-agent" = "mozilla/5.0") %>%
req_perform() %>%
resp_body_html() %>%
toString()
title_id <- str_match(r, '"titleId":"(.*?)"')[, 2]
start_cursor <- ""
end_cursor <- ""
repeat {
r <- request(sprintf("https://www.rottentomatoes.com/napi/movie/%s/criticsReviews/all/:sort", title_id)) %>%
req_url_query(f = "", direction = "next", endCursor = end_cursor, startCursor = start_cursor) %>%
req_perform() %>%
resp_body_json()
results[[n]] <- r$reviews
nextPage <- r$pageInfo$hasNextPage
if (!nextPage) break
start_cursor <- r$pageInfo$startCursor
end_cursor <- r$pageInfo$endCursor
n <- n + 1
}
return(results)
}
n <- 1
results <- list()
data <- get_reviews(results, n)
df <- purrr::map_dfr(data %>% unlist(recursive = F), ~
data.frame(
date = .x$creationDate,
reviewer = .x$publication$name,
url = .x$reviewUrl,
quote = .x$quote,
score = if (is.null(.x$scoreOri)) {
NA_character_
} else {
.x$scoreOri
},
sentiment = .x$scoreSentiment
))