Is there a procedure to enter each link of a Google results and extract text?

Question

A total newbie here in search for your wisdom (1st post/question, too)! Thank you in advance for you time and patience.

I am hoping to automatize scientific literature searches in Google Scholar using Selenium specifically (via Chrome) with Python. I envision entering a topic, which will be searched on Google Scholar, and then entering each link of the articles/books in the results, extracting the abstract/summary, and printing them on the console (or saving them on a text file). This will be an easy way to determine the relevancy of the articles in the results for the stuff that I'm writing.

Thus far, I am able to visit Google scholar, enter text in the search bar, filter by date (newest to oldest), and extract each of the links on the results. I have not been able to write a loop that will enter each article link and extract the abstracts (or other relevant text), as each result may have been coded differently.

Kind regards,

JP (Aotus_californicus)

This is my code so far:

   import pandas as pd
   from selenium import webdriver
   from selenium.webdriver.common.keys import Keys


   def get_results(search_term):
       url = 'https://scholar.google.com'
       browser = webdriver.Chrome(executable_path=r'C:\Users\Aotuscalifornicus\Downloads\chromedriver_win32\chromedriver.exe')

       browser.get(url)
       searchBar = browser.find_element_by_id('gs_hdr_tsi')
       searchBar.send_keys(search_term)
       searchBar.submit()
       browser.find_element_by_link_text("Trier par date").click()
       results = []
       links = browser.find_elements_by_xpath('//h3/a')
       for link in links:
           href = link.get_attribute('href')
           print(href)
           results.append(href)


       browser.close()
   get_results('Primate thermoregulation')

Are you trying to (1) get the abstract as shown in the search results or (2) use the actual search result/link to find some different abstract there? If it's (2) then yeah, you need to use different criteria to search for and extract text based on different element finding criteria; since every site is different. — aneroid, May 09 '20 at 14:54
Thank you, @aneroid! I was hoping for option 2 and figured it would not be as easy because of how each site is built. Another option would be to extract all of the text in each link and then I can figure out how much of it to filter out. For example looking at everything with a 'p' tag, or something like that. What do you think? — Aotus_parisinus, May 09 '20 at 16:05
If you need the extracted text to be useful then _"extract all of the text in each link "_ would be too much. Also, is just the synopsis enough for your purpose? That might be give a smaller more useful extract per link but still needs to be customised per site or per paper. I would recommend just using the summary as provided in the search results. If there's an _index_ for such papers with a synopsis/summary as written by the authors, then that would be ideal...and such indexes might be normalised in their formats. Would still need to create per-index extraction rules. — aneroid, May 09 '20 at 16:58
(I'm voting to close this question as it's not within the scope of StackOverflow.) — aneroid, May 09 '20 at 16:58
To clarify, I am looking to write a loop that enters each link and extracts an element by tag, for example, text. That is all! — Aotus_parisinus, May 10 '20 at 04:52

score 0 · Accepted Answer · answered May 10 '20 at 11:55

Wrt your comment, and using that as a basis for my answer:

To clarify, I am looking to write a loop that enters each link and extracts an element by tag, for example

Open a new window or start a new driver session to check the links in the results. Then use a rule to extract the text you want. You could re-use your existing driver session if you extract all the hrefs first or create a new tab as you get each result link.

for link in links:
    href = link.get_attribute('href')
    print(href)
    results.append(href)

extractor = webdriver.Chrome(executable_path=...)  # as above
for result in results:
    extractor.get(url)
    section_you_want = extractor.find_elements_by_xpath(...)  # or whichever set of rules
    # other code here

extractor.close()

You can setup rules to use with the base find_element() or find_elements() finders and then iterate over them until you get a result (validate best on element presence or text length or something sane & useful). Each of the the rules is a tuple that can be passed to the base finder function:

from selenium.webdriver.common.by import By  # see the docs linked above for the available `By` class attributes

rules = [(By.XPATH, '//h3/p'),
         (By.ID, 'summary'),
         (By.TAG_NAME, 'div'),
         ... # etc.
]

for url in results:
    extractor.get(url)
    for rule in rules:
        elems = extractor.find_elements(*rule)  # argument unpacking
        if not elems:
            continue  # not found, try next rule
        print(elems[0].getText())
        break  # stop after first successful "find"
    else:  # only executed if no rules match and `break` is never reached, or `rules` list is empty
        print('Could not find anything for url:', url)

You're welcome. And welcome to SO! If that helped, then read: [What should I do when someone answers my question?](https://stackoverflow.com/help/someone-answers) — aneroid, May 12 '20 at 19:37

Is there a procedure to enter each link of a Google results and extract text?

1 Answers1