Filling empty values from web scraping with a string (rvest)

Question

I am trying to scrape user reviews from a web site. Some of the reviews do not have body text so I am left with vectors of different lengths and getting the "arguments imply differing number of rows: 20, 19" error (20 is correct) when trying to combine the scraped datetime, rating, and review results into a data frame.

I have looked at the solution here which uses !nzchar to perform a replacement if the length of an html node is zero. This would seem to be a good solution for me but I can't get the code to insert a value into the vector to make the length correct. My code to scrape the node that contains an empty value is:

library(rvest)
library(tidyverse)
library(stringr)

url <- "http://www.trustpilot.com/review/www.amazon.com?page=2"
working_page <- read_html(url)

working_reviews <- working_page %>%
  html_nodes('.typography_body__9UBeQ.typography_color-black__5LYEn') %>%
  html_text(trim=TRUE) %>%
  replace(!nzchar(.), NA) %>%
  str_trim() %>%
  unlist()

length(working_reviews)

[1] 19

This returns a vector of 19 values; my expected output is a vector of 20 values, with 'NA' filling those values for which there isn't a review body. On this particular page, the 17th review contains no body text.

Desired result:

working_reviews[1]

[1] "I placed an order w/Amazon and selected the 18 payment plan. Amazon charged the entire amount to my card. Called them and got no where. I was told it was the banks fault and I had to take it up with them.Buyer be ware!!!"

working_reviews[17]

[17] "NA"

I have also tried using the following line to "force" insert a string into the empty review:

working_reviews <- working_page %>%
  html_nodes('.typography_body__9UBeQ.typography_color-black__5LYEn') %>%
  html_text(trim=TRUE) %>%
  replace(!nzchar(.), "No review") %>%
  str_trim() %>%
  unlist()

This produces the same result with a length of 19 and does not include an element containing "No review".

I also tried inverting the nzchar code as a test, removing the '!' and got back a 19-element vector with "NA" for every element.

Without the HTML to actually test with it's not easy to see what's going on here. It's easier to help you if you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Aug 01 '22 at 19:49
@MrFlick Thank you, I have made the edits to include the URL and desired outputs. — supercarp, Aug 01 '22 at 20:04
See this question and answer: https://stackoverflow.com/questions/65031696/rvest-scraping-google-news-with-different-number-of-rows/65038966#65038966 — Dave2e, Aug 01 '22 at 21:35

score 2 · Accepted Answer · answered Aug 01 '22 at 20:24

Neatly into a tibble and returns NA if the review is missing.

library(tidyverse)
library(rvest)

page <-
  "https://www.trustpilot.com/review/www.amazon.com?page=2" %>%
  read_html()

tibble(
  name = page %>%  
    html_elements(".styles_consumerName__dP8Um") %>% 
    html_text2(),
  rating = page %>% 
    html_elements(".styles_reviewHeader__iU9Px img") %>% 
    html_attr("alt") %>% 
    parse_number(),
  title = page %>% 
    html_elements(".link_notUnderlined__szqki.typography_color-inherit__TlgPO") %>% 
    html_text2(),
  review = page %>%
    html_elements(".styles_reviewCard__hcAvl") %>%
    map(. %>%
          html_element(".typography_body__9UBeQ") %>%
          html_text2) %>%
    unlist()
)

# A tibble: 20 x 4
   name               rating title                               review
   <chr>               <dbl> <chr>                               <chr> 
 1 Octo Cavazos            1 I placed an order w/Amazon and sel~ "I pl~
 2 Jeffrey Hayes           1 Don't waste your time,energy or mo~ "Don'~
 3 Andy Here               1 Over the pandemic                   "Over~
 4 Lorna Mills             1 Customer service                    "I or~
 5 Daniel Sthamer          1 Prime delivery isn't worth it anym~ "Amaz~
 6 Carolyn                 2 Amzon delivery is not worth the pr~ "Amaz~
 7 BruceW                  5 “We apologize but Amazon has notic~ "“We ~
 8 Matthew Smego           1 Aweful                              "Almo~
 9 goku                    1 Prime membership traps…             "They~
10 Antoinette Barnett      2 Customer loyalty and/or history ar~ "Been~
11 AC                      1 Amazon has gone to sh**             "Amaz~
12 customer                1 so I ask for a refund back to my a~ "so I~
13 Will Chen               1 Rude and stupid customer service    "If p~
14 Matthew Blevins         1 Amazon Claims They Did Not Receive~ "I us~
15 Gem                     1 Ordered puppy food Monday received… "Orde~
16 SuzyJ                   1 On August 9 2022 it will have be t~ "On A~
17 Isabelle                1 Item arrived poorly packed and dam~  NA   
18 Hannah veibel           1 no Money returned                   "I or~
19 DiConti Jenine          1 Amazon is a fraudulent company.     "Amaz~
20 Urvashi                 1 Only Buyer oriented marketplace     "Does~

Thank you Tom, I've run the code and it works - I will use this to move forward. — supercarp, Aug 01 '22 at 20:30

Filling empty values from web scraping with a string (rvest)

1 Answers1