1

I'm trying to extract the reviews of a product on Amazon, the urls of the reviews are placed on the same url with different page numbers, running manually this script is working but I need to change manually the number of the page in the url and the name of the tibble and run each time to get a different tibble.

Since it's quite boring for almost 70 pages I was trying to make a for loop to do the same thing the under the loop that I tried to do but it gives me an error

MANUAL 
```
library(tidyr)
library(rvest)

url_reviews <- "https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_16?ie=UTF8&reviewerType=all_reviews&pageNumber=16"
doc <- read_html(url_reviews) # Assign results to `doc`

# Review Title
doc %>% 
  html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
  html_text() -> review_title

# Review Text
doc %>% 
  html_nodes("[class='a-size-base review-text review-text-content']") %>%
  html_text() -> review_text

# Number of stars in review
doc %>%
  html_nodes("[data-hook='review-star-rating']") %>%
  html_text() -> review_star

# Return a tibble
page_16<-data.frame(review_title,
                review_text,
                review_star,
                page =16) 


FOR LOOP

``` 
range <- 12:82
    url_max <- paste0("https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_", range ,"?ie=UTF8&reviewerType=all_reviews&pageNumber=",range)
    
    
    for (i in 1:length(url_max)) {
     
      doc <- read_html(url_max[i]) # Assign results to `doc`
      
      # Review Title
      doc %>% 
        html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
        html_text() -> review_title
      
      # Review Text
      doc %>% 
        html_nodes("[class='a-size-base review-text review-text-content']") %>%
        html_text() -> review_text
      
      # Number of stars in review
      doc %>%
        html_nodes("[data-hook='review-star-rating']") %>%
        html_text() -> review_star
      
      
      paste0("page_", range)<-tibble(review_title,
                                              review_text,
                                              review_star,
                                              page = paste0("a", i)) 
                                                                                       
  }
     ```
Andrea
  • 105
  • 10

2 Answers2

1

Here's another alternative that defines a function and then uses lapply() to sequentially run the function.

The following might, however, be helpful for repeating this as necessary for different products. The function accepts two parameters, the first i is the page number and the second product is the product for which you are gathering reviews. The function constructs the url by pasting the appropriate page number.

While I used lapply(), the function below could also be inserted in the map_df() function in Ronak's answer (and would likely be faster than binding rows).

library(dplyr)
library(rvest)
library(stringr)

retrieve_reviews <- function(i, product) {

    urlstr <- "https://www.amazon.it/product-reviews/${product}/ref=cm_cr_getr_d_paging_btm_next_${i}?ie=UTF8&reviewerType=all_reviews&pageNumber=${i}"
    url <- str_interp(urlstr, list(product = product, i = i))
    doc <- read_html(url) # Assign results to `doc`
    
    # Review Title
    doc %>% 
        html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
        html_text() -> review_title
    
    # Review Text
    doc %>% 
        html_nodes("[class='a-size-base review-text review-text-content']") %>%
        html_text() -> review_text
    
    # Number of stars in review
    doc %>%
        html_nodes("[data-hook='review-star-rating']") %>%
        html_text() -> review_star
    
    return(tibble(
        title = review_title,
        text = review_text,
        star = review_star,
        page = paste0("a", i)
    ))
}


range <- 12:82
product <- "B07WTHVQZH"
reviews <- lapply(range, retrieve_reviews, product) %>%
    bind_rows()
mikebader
  • 1,075
  • 3
  • 12
  • Thank you for your answer @mikebader, the urlstr presents the ${product} before the actual ASIN (B07WTHVQZH), given the structure of amazon urls, the correct form could be this? urlstr <- "https://www.amazon.it/product-reviews/${product}/ref=cm_cr_getr_d_paging_btm_next_${i}?ie=UTF8&reviewerType=all_reviews&pageNumber=${i}" – Andrea Sep 15 '21 at 19:10
  • @Andrea, yes you are correct -- I didn't see that structure of the URL. I have updated the answer. – mikebader Sep 15 '21 at 19:13
0

You can use map_df from purrr to use loop.

library(rvest)

page_numbers <- 12:82

purrr::map_df(page_numbers, ~{
  url_reviews <- paste0("https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_16?ie=UTF8&reviewerType=all_reviews&pageNumber=", .x)  
  doc <- read_html(url_reviews) # Assign results to `doc`
  
  
  # Review Title
  doc %>% 
    html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
    html_text() -> review_title
  
  # Review Text
  doc %>% 
    html_nodes("[class='a-size-base review-text review-text-content']") %>%
    html_text() -> review_text
  
  # Number of stars in review
  doc %>%
    html_nodes("[data-hook='review-star-rating']") %>%
    html_text() -> review_star
  
  # Return a tibble
  data.frame(review_title,
              review_text,
              review_star,
              page =.x) 
}) -> result

result
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you! your correction worked, I just corrected a small typo in the url_reviews (used the .x ) to correct the typo of the page 16 that was left. May I ask you about the ~ used in the opening part? – Andrea Sep 14 '21 at 15:48
  • `~` is formula style syntax used as an alternative for anonymous function. These links have more explanation https://stackoverflow.com/questions/56621051/in-map-when-is-it-necessary-to-use-a-tilde-and-a-period-and and https://stackoverflow.com/questions/44834446/what-is-meaning-of-first-tilde-in-purrrmap – Ronak Shah Sep 15 '21 at 23:06