1

I want to pull a few basic information from Google Scholar such as Title_name, Author_Names, Year_Publication, Title_URL, and cited_by across all Google Scholar pages but as a test wanted to extract information from 2 pages.

The purpose of this webscraping is to generate a list of studies for literature review leading to a meta-analysis study.

I have been trying to edit the following code but no luck:

# Install and load the necessary packages
#install.packages("RSelenium")
##install.packages("rvest")
#install.packages("stringr")

library(RSelenium)
library(rvest)
library(stringr)

# Start a Selenium server and open Chrome browser

rD <- rsDriver(browser = "chrome", chromever = "latest", geckover = "latest", 
               IEDriverVersion = NULL, verbose = FALSE, check = TRUE, 
               extraCapabilities = NULL, verboseInfo = FALSE, checkInterval = 1000, 
               timeout = 20000, whitelist = NULL, checkPath = TRUE, port = 4445L, 
               phantomver = NULL, 
               chromepath = NULL, firefoxpath = "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe")


remDr <- rD$client

# Define your search terms
search_terms <- "((COVID OR COVID-19))"

# Function to extract data from a page
extract_data <- function(page_source) {
  page <- read_html(page_source)
  titles <- page %>% html_nodes(".gs_rt") %>% html_text()
  authors <- page %>% html_nodes(".gs_a") %>% html_text()
  years <- str_extract(authors, "\\d{4}")
  authors <- str_replace(authors, "\\d{4}", "")
  urls <- page %>% html_nodes(".gs_rt a") %>% html_attr("href")
  cited_by <- page %>% html_nodes(".gs_fl a:nth-child(3)") %>% html_text()
  cited_by <- as.integer(str_extract(cited_by, "\\d+"))
  
  data.frame(Title_name = titles, Author_Names = authors, Year_Publication = years, Title_URL = urls, cited_by = cited_by)
}


# Function to search for a specific term on Google Scholar
search_google_scholar <- function(term) {
  tryCatch({
    remDr$navigate("https://scholar.google.com/")
    search_box <- remDr$findElement("css", "#gs_hdr_tsi")
    search_box$sendKeysToElement(list(term, key="enter"))
    Sys.sleep(5) # Allow time for page to load
    
    pages <- 2 # Number of pages to scrape 
    results <- data.frame()
    
    for (page in 1:pages) {
      page_source <- remDr$getPageSource()[[1]]
      page_data <- extract_data(page_source)
      results <- rbind(results, page_data)
      
      next_button <- remDr$findElement("css", "#gs_n a")
      if (length(next_button) == 0) {
        break
      } else {
        next_button$clickElement()
        Sys.sleep(5) # Allow time for page to load
      }
    }
    
    return(results)
  }, error = function(e) {
    message("An error occurred: ", conditionMessage(e))
    NULL
  })
}

# Execute the search and scrape the data
search_results <- search_google_scholar(search_terms)

# Close the browser
remDr$close()
rD$server$stop()

Can anyone help me modify the above code or suggest a simple workaround?

neilfws
  • 32,751
  • 5
  • 50
  • 63
  • 1
    You might want to try out https://pypi.org/project/scholarly/ using https://cran.r-project.org/web/packages/reticulate/vignettes/calling_python.html – Mark Jul 21 '23 at 04:32
  • 1
    Great question! I am working on a related question over here: https://stackoverflow.com/questions/76701351/html-xml-understanding-how-scroll-bars-work – stats_noob Jul 21 '23 at 04:46
  • 1
    Can you expand on what is the problem you get into? – Ric Jul 21 '23 at 05:28
  • Related: https://stackoverflow.com/a/71236400/12957340 – jared_mamrot Jul 21 '23 at 05:32
  • @stats_noob I got close enough. Maybe it would help you as well. Please take a look. – Prajwal Mani Pradhan Jul 21 '23 at 17:37
  • @Ric This was my first time doing it so I was unaware of what I should be expecting. Now that I have got a few results. I now know that the initial problem was R was unable to locate the Chromedriver executable file. I had to download it from a different website. – Prajwal Mani Pradhan Jul 21 '23 at 17:39
  • @ Prajwal: thank you for your reply! do you have any ideas about my question? – stats_noob Jul 21 '23 at 17:52
  • @ stats_noob: I am also lost on the scrolling piece. Based on the ongoing discussion there, I hope you will arrive at a working solution soon. I will be following your question for my self-knowledge. Good luck! These are indeed good problems to solve! :) – Prajwal Mani Pradhan Jul 24 '23 at 23:01

1 Answers1

0

I figured out the problem. The google chrome was not launching because I was missing chrome driver. I downloaded it from here: https://googlechromelabs.github.io/chrome-for-testing/#stable [Matched the chrome version listed there to the chrome version that I have installed-I had to upgrade my chrome version].

I also edited my code. Here is the working version of the code for the first two pages of google scholar.

    # Load the necessary packages
library(RSelenium)
library(rvest)
library(stringr)
# Specify the path to the chromedriver executable
#chromedriver_path <- "C:/Program Files/Google/Chrome/Application/115.0.5790.102/chromedriver.exe"  # Replace with the actual path

# Start the Selenium server on a different port (e.g., 5555)
#rD <- rsDriver(browser = "chrome", port = 5555)

# Specify the path to the chromedriver executable
chromedriver_path <- "C:/Program Files/Google/Chrome/Application/115.0.5790.102/chromedriver.exe"  # Replace with the actual path

# Set the system property for the chromedriver executable path
# This is necessary for the Java-based Selenium server used by RSelenium
Sys.setenv(CHROMEDRIVER_PATH = chromedriver_path)

# Start the Selenium server and open the Chrome browser
rD <- rsDriver(browser = "chrome")

# Get the remote driver (remDr) object
remDr <- rD[["client"]]


# Define your search terms
search_terms <- "((COVID OR COVID-19))"

# Get the remote driver (remDr) object
#remDr <- rD$client

# Set the additional capabilities with the chromedriver path
#extra_capabilities <- list(chromedriverExecutable = chromedriver_path)
#remDr$setCapabilities(extra_capabilities)



# Function to extract data from a page
extract_data <- function(page_source) {
  page <- read_html(page_source)
  titles <- page %>% html_nodes(".gs_rt") %>% html_text()
  authors <- page %>% html_nodes(".gs_a") %>% html_text()
  years <- str_extract(authors, "\\d{4}")
  authors <- str_replace(authors, "\\d{4}", "")
  urls <- page %>% html_nodes(".gs_rt a") %>% html_attr("href")
  cited_by <- page %>% html_nodes(".gs_fl a:nth-child(3)") %>% html_text()
  cited_by <- as.integer(str_extract(cited_by, "\\d+"))
  
  data.frame(Title_name = titles, Author_Names = authors, Year_Publication = years, Title_URL = urls, cited_by = cited_by)
}

# Function to search for a specific term on Google Scholar
search_google_scholar <- function(term) {
  tryCatch({
    remDr$navigate("https://scholar.google.com/")
    search_box <- remDr$findElement("css", "#gs_hdr_tsi")
    search_box$sendKeysToElement(list(term, key="enter"))
    Sys.sleep(5) # Allow time for page to load
    
    pages <- 2 # Number of pages to scrape 
    results <- data.frame()
    
    for (page in 1:pages) {
      page_source <- remDr$getPageSource()[[1]]
      page_data <- extract_data(page_source)
      results <- rbind(results, page_data)
      
      next_button <- remDr$findElement("css", "#gs_n a")
      if (length(next_button) == 0) {
        break
      } else {
        next_button$clickElement()
        Sys.sleep(5) # Allow time for page to load
      }
    }
    
    return(results)
  }, error = function(e) {
    message("An error occurred: ", conditionMessage(e))
    NULL
  })
}

# Execute the search and scrape the data
search_results <- search_google_scholar(search_terms)

# Close the browser
remDr$close()

# Stop the Selenium server
rD$server$stop()

Of course, the webscrapping is not perfect for many columns. Imperfections such as HTML and pdf and so on are listed for authors column. Therefore, I need to preprocess the data more to automate my literature review but this gets me close enough. Hopefully, it helps future researchers.