Webscraping the data of a graph using python

Question

I want to webscrape the data of a graph that can be found on this webpage. For this purpose, I am using Selenium in Python (Pycharm) . So far this is my code:

from selenium import webdriver
mozilla_path = r"C:\Users\ivrav\Python38\geckodriver.exe"
driver = webdriver.Firefox()
driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
driver.maximize_window()
Researcher=driver.find_element_by_xpath("""//*[@id="gsc_rsb_cit"]/div/div[3]/div""") .click()
Graph=driver.find_elements_by_id("gsc_md_hist_b")
print(Graph.text)

The code works fine until it has to take the information (years and citations per year) from the graph, the reply is that there is no text to scrape. Could you give me some ideas of how can I scrape the information I need?

Many thanks in advance, Iván

You could also be looking directly for ``'s of class `.gsc_g_t` for the years, while the citation counts are in ` `. — Asmus, Jul 20 '20 at 07:44

Ashish Karn · Answer 1 · 2020-07-20T12:10:01.190

You can try by using xpath with class attribute and fetching all span test as list. Please check below untested code:

from selenium import webdriver
mozilla_path = r"C:\Users\ivrav\Python38\geckodriver.exe"
driver = webdriver.Firefox()
driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
driver.maximize_window()
Researcher=driver.find_element_by_xpath("""//*[@id="gsc_rsb_cit"]/div/div[3]/div""") .click()
#Graph=driver.find_elements_by_id("gsc_md_hist_b")
#Graph=driver.find_elements_by_xpath('//div[@class=".gsc_md_hist_b"]//span[@class=".gsc_g_t"]')
Graph=driver.find_elements_by_xpath("//span[@class='gsc_g_t']")

for spanText in Graph:
    print(spanText.text)

BarValue=driver.find_elements_by_xpath("//span[@class='gsc_g_al']")
for barValueText in BarValue:
        print(barValueText.text)

Many thanks, Ashish Karn! Do you know how can I scrape the information on the bars (number of citations)? I am having struggles scraping this info. Many thanks in advance, Iván — Iván, Jul 20 '20 at 11:58

score 0 · Answer 2 · answered Jul 20 '20 at 09:52

To extract the information of the years you have to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:

Using XPATH:

driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@id='gsc_rsb_cit']//div[@class='gsc_md_hist_w']/div[@class='gsc_md_hist_b']"))).click()
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@id='gsc_md_hist_c']//div[@class='gsc_md_hist_w']/div[@class='gsc_md_hist_b']//span[@class='gsc_g_t']")))])

Console Output:

['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020']

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Many thanks @DebanjanB! Actually, I scraped the years, but I am having problems with scraping the information of the bars (number of citations). Do you have some advice to achieve this? Again, many thanks Iván — Iván, Jul 20 '20 at 11:56
@Iván This answer was constructed as per your code trials.Yes, I do have a solution for _number of citations_ as well but I'm afraid, for that you have to raise a new ticket along with your code trials. Hint: You have to _mouse hover_. — undetected Selenium, Jul 20 '20 at 12:00

Webscraping the data of a graph using python

2 Answers2