I'm using a script in RStudio to scrape Goodreads reviews. If the reviews are long, they are only partially visible, you have to click on "...more" to see the rest of the review:
The script deals with this and "clicks" on the "...more" (if it occurs), so the entire review is scraped:
#Expand all reviews
expandMore <- remDr$findElements("link text", "...more")
sapply(expandMore, function(x) x$clickElement())
#Extracting the reviews from the page
reviews <- remDr$findElements("css selector", "#bookReviews .stacked")
reviews.html <- lapply(reviews, function(x){x$getElementAttribute("outerHTML")[[1]]})
reviews.list <- lapply(reviews.html, function(x){read_html(x) %>% html_text()} )
reviews.text <- unlist(reviews.list)
The problem is that part of the review is now scraped twice: first the preview and then the entire review. So what I get is a CSV file containing this kind of reviews (short example I wrote myself):
I really liked the book because of many reasons which will
I really liked the book because of many reasons which will be explained in this long review. As you can see the first part of my review is repeated.
I want the script to only scrape the full review after "...more" and to ignore or dismiss the preview, but I also still want it to simply scrape the short reviews (without "...more"). So basically I want it to only look at the review after "...more" IF "...more" is present.
I've tried to do it myself by using html_nodes("span:last-child")
and exchanging the line reviews.list <- lapply(reviews.html, function(x){read_html(x) %>% html_text()} )
for reviews.list <- lapply(reviews.html, function(x){ read_html(x) %>% html_nodes("span:last-child") %>% html_text() } )
.
This causes the following error:
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "character"
I guess there is a clash between html_nodes
and span:last-child
? Is there an alternative I could use instead of html_nodes("span:last-child")
or is there a way to adapt this to prevent the error?
Thank you in advance!