R - Using rvest to scrape Google + reviews

Question

As part of a project, I am trying to scrape the complete reviews from Google + (in previous attempts on other websites, my reviews were truncated by a More which hides the full review unless you click on it).

I have chosen the package rvest for this. However, I do not seem to be getting the results I want.

Here are my steps

library(rvest)
library(xml2)
library(RSelenium)

queens <- read_html("https://www.google.co.uk/search?q=queen%27s+hospital+romford&oq=queen%27s+hospitql+&aqs=chrome.1.69i57j0l5.5843j0j4&sourceid=chrome&ie=UTF-8#lrd=0x47d8a4ce4aaaba81:0xf1185c71ae14d00,1,,,")

#Here I use the selectorgadget tool to identify the user review part that I wish to scrape

reviews=queens %>%
html_nodes(".review-snippet") %>%
html_text()

However this doesn't seem to be working. I do not get any output here.

I am quite new to this package and web scraping, so any inputs on this would be greatly appreciated.

This is a violation of Google's Terms of Service. – hrbrmstr May 05 '18 at 01:31 — hrbrmstr, May 05 '18 at 01:31

score 4 · Accepted Answer · edited Apr 01 '19 at 07:38

4

Here is the workflow with RSelenium and rvest:
1. Scroll down any times to get as many contents as you want, remember to pause once a while to let the contents load.
2. Click on all "click on more" buttons and get full reviews.
3. Get pagesource and use rvest to get all reveiws in a list

What you want to scrape is not static, so you need the help of RSelenium. This should work:

library(rvest)
library(xml2)
library(RSelenium)

rmDr=rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
myclient= rmDr$client
myclient$navigate("https://www.google.co.uk/search?q=queen%27s+hospital+romford&oq=queen%27s+hospitql+&aqs=chrome.1.69i57j0l5.5843j0j4&sourceid=chrome&ie=UTF-8#lrd=0x47d8a4ce4aaaba81:0xf1185c71ae14d00,1,,,")
#click on the snippet to switch focus----------
webEle <- myclient$findElement(using = "css",value = ".review-snippet")
webEle$clickElement()
#simulate scroll down for several times-------------
scroll_down_times=20
for(i in 1 :scroll_down_times){
    webEle$sendKeysToActiveElement(sendKeys = list(key="page_down"))
    #the content needs time to load,wait 1 second every 5 scroll downs
    if(i%%5==0){
        Sys.sleep(1)
    }
}
#loop and simulate clicking on all "click on more" elements-------------
webEles <- myclient$findElements(using = "css",value = ".review-more-link")
for(webEle in webEles){
    tryCatch(webEle$clickElement(),error=function(e){print(e)}) # trycatch to prevent any error from stopping the loop
}
pagesource= myclient$getPageSource()[[1]]
#this should get you the full review, including translation and original text-------------
reviews=read_html(pagesource) %>%
    html_nodes(".review-full-text") %>%
    html_text()

#number of stars
stars <- read_html(pagesource) %>%
    html_node(".review-dialog-list") %>%
    html_nodes("g-review-stars > span") %>%
    html_attr("aria-label")


#time posted
post_time <- read_html(pagesource) %>%
    html_node(".review-dialog-list") %>%
    html_nodes(".dehysf") %>%
    html_text()

edited Apr 01 '19 at 07:38

stevec

41,291
27
223
311

answered May 04 '18 at 21:59

yusuzech

5,896
1
18
33

On running rmDr=rsDriver(browser = "chrome") I seem to be getting an error which says 'Undefined error in httr call. httr output: Failed to connect to localhost port 4567: Connection refused' – Varun May 05 '18 at 13:36
It happens sometimes, port `4567L` may have been taken by other applications, you can try other ports like `4444L` or `4445L`. – yusuzech May 05 '18 at 16:56
Now on running `pagesource= myclient$getPageSource()[[1]]` I get another error message that says `chrome not reachable\n (Session info: chrome=xxxxxx)`. Any idea how to get around this error? – Varun May 06 '18 at 09:56
I don't have this error. What version of R are you using right now? Recently, Rselenium's dependent packages were no longer available due to the 3.5 update. If you are using the newest version. You can swith to older one first. – yusuzech May 07 '18 at 08:14
I'm running on `R version 3.4.1 (2017-06-30)`, so technically I shouldn't be facing this issue – Varun May 07 '18 at 19:14
Can you see if this solves your problem: https://stackoverflow.com/questions/45688020/chrome-not-reachable-selenium-webdriver-error – yusuzech May 07 '18 at 19:39
Thanks, I figured out a work around. However when I scrape the reviews, they seems to be translated and I only get 9 results (there are many more). Any suggestions on how I can scrape all reviews (in the language of origin)? – Varun May 07 '18 at 21:54
Also the reviews seems to be truncated after a point. Do you know how I can get the full reviews? – Varun May 07 '18 at 21:57
You can simulate press "scroll down to get more items", I edited my code, you can check that. About the language problem, you can try to set it in the browser instead of R. Since I cannot simulate your working environment so I cannot replicate your error. – yusuzech May 08 '18 at 00:58
This is great. As a final request, is it possible to simulate the "click on More" link for each review, which expands the review so the full thing appears? It seems that lengthy reviews get truncated because the `More` needs to be selected. – Varun May 08 '18 at 06:29
I just updated my answer. It should meet all your requests. – yusuzech May 08 '18 at 18:10
Hello `Yifu Yan`, I've been meaning to add the user rating (number of stars attributed) for each review, as well as with the date of post (e.g. `3 months ago`. I've tried to include additional chunks to the rvest part after `html_text`, but it doesn't seem to be working. If you help with this last request, it would be awesome. Thanks a lot. Let me know if you'd like me to edit the original question. – Varun May 13 '18 at 15:54
I continued to use Rselenium. I just updated my answer. – yusuzech May 13 '18 at 19:17
Thanks again! I'm curious to know how you find the values for `html_node`? I use the SelectorGadget plugin, but it does not give me this information. If you have any reading material on this, it would be very helpful. Also, can I use this methodology for other forms of web-scraping too (Tripadvisor, Amazon etc..)? – Varun May 14 '18 at 11:43
CSS selector is very conveinient, but it may not provide the optimal css While using chrome, you can right click anything on the website and click Inspect. Or you can press F12 and find elements using keyword searching in the DOM tree. Yes, you can use the same methodology. – yusuzech May 15 '18 at 18:36
1

Hello Yifu Yan, I have recently posted a question, based on the question that you answered here. I was hoping you could help. Here is the link to the question. Thanks https://stackoverflow.com/questions/50680985/r-scrape-a-list-of-google-urls-using-purrr-package – Varun Jun 04 '18 at 12:49

R - Using rvest to scrape Google + reviews

1 Answers1