0

I'm using a script in RStudio to scrape Goodreads reviews. If the reviews are long, they are only partially visible, you have to click on "...more" to see the rest of the review:

enter image description here

The script deals with this and "clicks" on the "...more" (if it occurs), so the entire review is scraped:

  #Expand all reviews
  expandMore <- remDr$findElements("link text", "...more")
  sapply(expandMore, function(x) x$clickElement())

  #Extracting the reviews from the page
  reviews <- remDr$findElements("css selector", "#bookReviews .stacked")
  reviews.html <- lapply(reviews, function(x){x$getElementAttribute("outerHTML")[[1]]})
  reviews.list <- lapply(reviews.html, function(x){read_html(x) %>% html_text()} )
  reviews.text <- unlist(reviews.list)

The problem is that part of the review is now scraped twice: first the preview and then the entire review. So what I get is a CSV file containing this kind of reviews (short example I wrote myself):

I really liked the book because of many reasons which will 
I really liked the book because of many reasons which will be explained in this long review. As you can see the first part of my review is repeated.

I want the script to only scrape the full review after "...more" and to ignore or dismiss the preview, but I also still want it to simply scrape the short reviews (without "...more"). So basically I want it to only look at the review after "...more" IF "...more" is present.

I've tried to do it myself by using html_nodes("span:last-child")and exchanging the line reviews.list <- lapply(reviews.html, function(x){read_html(x) %>% html_text()} ) for reviews.list <- lapply(reviews.html, function(x){ read_html(x) %>% html_nodes("span:last-child") %>% html_text() } ).

This causes the following error:

Error in UseMethod("xml_find_all") : 
no applicable method for 'xml_find_all' applied to an object of class "character"

I guess there is a clash between html_nodes and span:last-child? Is there an alternative I could use instead of html_nodes("span:last-child") or is there a way to adapt this to prevent the error?

Thank you in advance!

MartinH
  • 13
  • 2
  • Hi MartinH, does it work if you change `html_nodes()` to `html_node()`? If not, please provide a small reproducible example that we can use to help you. – Bas Jun 23 '20 at 12:22
  • Hello @Bas, thank you for replying! I've tried it by changing it to `html_node()`, but unfortunately it didn't work. It looked like it was going to but it went wrong after scraping the second page of reviews. I received the following error message: `PAGE 2 Processed - Going to next Error in if (sum(!onlyRating) > 0) { : missing value where TRUE/FALSE needed In addition: Warning message: All formats failed to parse. No formats found.` I don't know how to provide a small reproducible example (I'm new to this), but I could share the full script if that would help? Thank you for helping! – MartinH Jun 23 '20 at 15:03
  • This message indicates that there is a missing value (`NA`). I guess `onlyRating` is missing. You could include a check like `is.na(onlyRating)`, and see for which page this happens. Sharing a full script is usually not a good idea since it contains a lot of stuff not relevant to your question. Try to strip out those irrelevant parts and share the rest here. See also [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for a nice explanation on reproducible examples. – Bas Jun 23 '20 at 17:49

0 Answers0