0

I am working on a web scraping project, which aims to extract Google + reviews from a set of children's hospitals. My methodology is as follows:

1) Define a list of Google + urls to navigate to for review scraping. The urls are in a dataframe along with other variables defining the hospital.

2) Scrape reviews, number of stars, and post time for all reviews related to a given url.

3) Save these elements in a dataframe, and name the dataframe after another variable in the dataframe corresponding to the url.

4) Move on to the next url ... and so on till all urls are scraped.

Currently, the code is able to scrape from a single url. I have tried to create a function using map from the purrr package. However it doesn't seem to be working, I am doing something wrong.

Here is my attempt, with comments on the purpose of each step

#Load the necessary libraries
devtools::install_github("ropensci/RSelenium")
library(purrr)
library(dplyr)
library(stringr)
library(rvest)
library(xml2)
library(RSelenium)
#To avoid any SSL error messages
library(httr)
set_config( config( ssl_verifypeer = 0L ) )

Defining the URL dataframe

#Now to define the dataframe with the urls
urls_df =data.frame(Name=c("CHKD","AIDHC")
                    ,ID=c("AAWZ12","AAWZ13")
                    ,GooglePlus_URL=c("https://www.google.co.uk/search?ei=fJUKW9DcJuqSgAbPsZ3gDQ&q=Childrens+Hospital+of+the+Kings+Daughter+&oq=Childrens+Hospital+of+the+Kings+Daughter+&gs_l=psy-ab.3..0i13k1j0i22i10i30k1j0i22i30k1l7.8445.8445.0.9118.1.1.0.0.0.0.144.144.0j1.1.0....0...1c.1.64.psy-ab..0.1.143....0.qDMr7IDA-uA#lrd=0x89ba9869b87f1a69:0x384861b1e3a4efd3,1,,,",
                                      "https://www.google.co.uk/search?q=Alfred+I+DuPont+Hospital+for+Children&oq=Alfred+I+DuPont+Hospital+for+Children&aqs=chrome..69i57.341j0j8&sourceid=chrome&ie=UTF-8#lrd=0x89c6fce9425c92bd:0x80e502f2175fb19c,1,,,"
                                      ))

Creating the function

extract_google_review=function(googleplus_urls) {

  #Opens a Chrome session
  rmDr=rsDriver(browser = "chrome",check = F)
  myclient= rmDr$client

  #Creates a sub-dataframe for the filtered hospital, which I will later use to name the dataframe
  urls_df_sub=urls_df %>% filter(GooglePlus_URL %in% googleplus_urls)

  #Navigate to the url
  myclient$navigate(googleplus_urls)

  #click on the snippet to switch focus----------
  webEle <- myclient$findElement(using = "css",value = ".review-snippet")
  webEle$clickElement()
  # Save page source
  pagesource= myclient$getPageSource()[[1]]

  #simulate scroll down for several times-------------
  count=read_html(pagesource) %>%
    html_nodes(".p13zmc") %>%
    html_text()

  #Stores the number of reviews for the url, so we know how many times to scroll down
  scroll_down_times=count %>%
    str_sub(1,nchar(count)-5) %>%
    as.numeric()

  for(i in 1 :scroll_down_times){
    webEle$sendKeysToActiveElement(sendKeys = list(key="page_down"))
    #the content needs time to load,wait 1.2 second every 5 scroll downs
    if(i%%5==0){
      Sys.sleep(1.2)
    }
  }

  #loop and simulate clicking on all "click on more" elements-------------
  webEles <- myclient$findElements(using = "css",value = ".review-more-link")
  for(webEle in webEles){
    tryCatch(webEle$clickElement(),error=function(e){print(e)})
  }

  pagesource= myclient$getPageSource()[[1]]
  #this should get the full review, including translation and original text
    reviews=read_html(pagesource) %>%
    html_nodes(".review-full-text") %>%
    html_text()

  #number of stars
  stars <- read_html(pagesource) %>%
    html_node(".review-dialog-list") %>%
    html_nodes("g-review-stars > span") %>%
    html_attr("aria-label")

  #time posted
  post_time <- read_html(pagesource) %>%
    html_node(".review-dialog-list") %>%
    html_nodes(".dehysf") %>%
    html_text()

  #Consolidating everything into a dataframe
  reviews=head(reviews,min(length(reviews),length(stars),length(post_time)))
  stars=head(stars,min(length(reviews),length(stars),length(post_time))) 
  post_time=head(post_time,min(length(reviews),length(stars),length(post_time)))
  reviews_df=data.frame(review=reviews,rating=stars,time=post_time)

  #Assign the dataframe a name based on the value in column 'Name' of the dataframe urls_df, defined above
  df_name <- tolower(urls_df_sub$Name)

  if(exists(df_name)) {
    assign(df_name, unique(rbind(get(df_name), reviews_df)))
  } else {
    assign(df_name, reviews_df)
  }


} #End function

Feeding the urls into the function

#Now that the function is defined, it is time to create a vector of urls and feed this vector into the function
googleplus_urls=urls_df$GooglePlus_URL
googleplus_urls %>% map(extract_google_review)

There seems to be an error in the function ,which is preventing it from scraping and storing the data into separate dataframes like intended.

My Intended Output

2 dataframes, each with 3 columns

Any pointers on how this can be improved will be greatly appreciated.

Varun
  • 1,211
  • 1
  • 14
  • 31
  • You should better use for loop instead of map. Because you need to open additional windows for web-scraping using selenium. You can check selenium parallel scraping guide. Currently, I don't think there is a R version thouh. – yusuzech Jun 06 '18 at 17:09
  • Ok. So this is not possible for R at the moment? – Varun Jun 06 '18 at 17:11
  • Check this: https://stackoverflow.com/questions/38950958/run-rselenium-in-parallel?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa – yusuzech Jun 06 '18 at 17:14
  • I think I'm wrong, it's possible to do it with other packages. The link above should work. – yusuzech Jun 06 '18 at 17:15
  • I tried using `foreach` like you suggested from the link above, but it still doesn't seem to be working – Varun Jun 06 '18 at 18:45
  • The error probably comes from the part where each dataframe is named based on the corresponding value in the `Name` column in `urls_df` – Varun Jun 06 '18 at 18:48
  • It could be. If all other methods don't work, you can still use for loop. – yusuzech Jun 06 '18 at 21:39
  • Hello Yifu Yan, I was looking to add another element to the dataframe scraping the reviews. I'd like to include the replies to a review posted (if any). Could you please help identify the html snippet that corresponds to this block? I tried right clicking and selecting 'Inspect', but I cannot find the right snippet. Thanks – Varun Aug 20 '18 at 16:21

0 Answers0