The answer from andrew_reece at the moment of answering this question isn't working even that the h3
tag with the correct class is located in the source code it will still throw an error e.g. get a CAPTCHA because Google detected your script as an automated script. Print response to see the message.
I got this after sending too many requests:
The block will expire shortly after those requests stop.
Sometimes you may be asked to solve the CAPTCHA
if you are using advanced terms that robots are known to use,
or sending requests very quickly.
The first thing you can do is to add proxies to your request:
#https://docs.python-requests.org/en/master/user/advanced/#proxies
proxies = {
'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}
Request code will be like this:
html = requests.get('google scholar link', headers=headers, proxies=proxies).text
Or you can make it work by using requests-HTML
or selenium
or pyppeteer without proxies, just rendering the page.
Code:
# If you'll get an empty array, this means you get a CAPTCHA.
from requests_html import HTMLSession
import json
session = HTMLSession()
response = session.get('https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=vicia+faba&btnG=')
# https://requests-html.kennethreitz.org/#javascript-support
response.html.render()
results = []
# Container where data we need is located
for result in response.html.find('.gs_ri'):
title = result.find('.gs_rt', first = True).text
# print(title)
# converting dict of URLs to strings (see how it will be without next() iter())
url = next(iter(result.absolute_links))
# print(url)
results.append({
'title': title,
'url': url,
})
print(json.dumps(results, indent = 2, ensure_ascii = False))
Part of the output:
[
{
"title": "Faba bean (Vicia faba L.)",
"url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
},
{
"title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
"url": "https://scholar.google.com/scholar?cluster=956029896799880103&hl=en&as_sdt=0,5"
}
]
Essentially, you can do the same with Google Scholar API from SerpApi. But you don't have to render the page or use browser automating e.g. selenium
to get data from Google Scholar. Get an instant JSON output, that will be faster than selenium
or reqests-html
, without thinking about how to bypass Google blocking.
It's a paid API with a trial of 5,000 searches. A completely free trial is currently under development.
Code to integrate:
from serpapi import GoogleSearch
import json
params = {
"api_key": "YOUR_API_KEY",
"engine": "google_scholar",
"q": "vicia faba",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
results_data = []
for result in results['organic_results']:
title = result['title']
url = result['link']
results_data.append({
'title': title,
'url': url,
})
print(json.dumps(results_data, indent = 2, ensure_ascii = False))
Part of the output:
[
{
"title": "Faba bean (Vicia faba L.)",
"url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
},
{
"title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
"url": "https://www.sciencedirect.com/science/article/pii/S0378429009002512"
},
]
Disclaimer, I work for SerpApi.