17

I'd like to use python to scrape google scholar search results. I found two different script to do that, one is gscholar.py and the other is scholar.py (can that one be used as a python library?).

Now, I should maybe say that I'm totally new to python, so sorry if I miss the obvious!

The problem is when I use gscholar.py as explained in the README file, I get as a result

query() takes at least 2 arguments (1 given).

Even when I specify another argument (e.g. gscholar.query("my query", allresults=True), I get

query() takes at least 2 arguments (2 given).

This puzzles me. I also tried to specify the third possible argument (outformat=4; which is the BibTex format) but this gives me a list of function errors. A colleague advised me to import BeautifulSoup and this before running the query, but also that doesn't change the problem. Any suggestions how to solve the problem?

I found code for R (see link) as a solution but got quickly blocked by google. Maybe someone could suggest how improve that code to avoid being blocked? Any help would be appreciated! Thanks!

SashaZd
  • 3,315
  • 1
  • 26
  • 48
Flow
  • 735
  • 2
  • 7
  • 17

7 Answers7

16

I suggest you not to use specific libraries for crawling specific websites, but to use general purpose HTML libraries that are well tested and has well formed documentation such as BeautifulSoup.

For accessing websites with a browser information, you could use an url opener class with a custom user agent:

from urllib import FancyURLopener
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = MyOpener().open

And then download the required url as follows:

openurl(url).read()

For retrieving scholar results just use http://scholar.google.se/scholar?hl=en&q=${query} url.

To extract pieces of information from a retrieved HTML file, you could use this piece of code:

from bs4 import SoupStrainer, BeautifulSoup
page = BeautifulSoup(openurl(url).read(), parse_only=SoupStrainer('div', id='gs_ab_md'))

This piece of code extracts a concrete div element that contains number of results shown in a Google Scholar search results page.

Julia
  • 176
  • 1
  • 4
9

Google will block you... as it will be apparent you aren't a browser. Namely, they will detect the same request signature occurring too frequently compared with a reasonable human activity.

You can do:


Edit 2020:

You might want to check scholarly

>>> search_query = scholarly.search_author('Marty Banks, Berkeley')
>>> print(next(search_query))
{'_filled': False,
 'affiliation': 'Professor of Vision Science, UC Berkeley',
 'citedby': 17758,
 'email': '@berkeley.edu',
 'id': 'Smr99uEAAAAJ',
 'interests': ['vision science', 'psychology', 'human factors', 'neuroscience'],
 'name': 'Martin Banks',
 'url_picture': 'https://scholar.google.com/citations?view_op=medium_photo&user=Smr99uEAAAAJ'}
0x90
  • 39,472
  • 36
  • 165
  • 245
  • I am trying to get a single page: `requests.get("https://scholar.google.com/scholar?q=compressed+differential+heuristic")` and still get `` – AlwaysLearning Dec 28 '16 at 19:22
  • @AlwaysLearning, Thank you for supporting my initial claim. – 0x90 Dec 28 '16 at 20:23
  • 1
    @0x90 Scholarly also behaves same for too many requests. – Naila Akbar Jun 16 '20 at 18:53
  • Unfortunately, [there's no official Google Scholar API](https://academia.stackexchange.com/a/34973), with that said, the third link in the unordered list is dead. As an alternative that can scale to enterprise-level, there's a [Google Scholar API](https://serpapi.com/google-scholar-api) from SerpApi that supports [organic](https://serpapi.com/google-scholar-organic-results), [cite](https://serpapi.com/google-scholar-cite-api), [profile](https://serpapi.com/google-scholar-profiles-api), [author](https://serpapi.com/google-scholar-author-api) results. – Dmitriy Zub Feb 23 '22 at 10:07
  • How do I get a single value from the returned generator? I just want to extract one field like cited by. – lizardpeter Mar 22 '22 at 22:28
4

It looks like scraping with Python and R runs into the problem where Google Scholar sees your request as a robot query due to a lack of a user-agent in the request. There is a similar question in StackExchange about downloading all pdfs linked from a web page and the answer leads the user to wget in Unix and the BeautifulSoup package in Python.

Curl also seems to be a more promising direction.

Community
  • 1
  • 1
y-i_guy
  • 673
  • 6
  • 8
2

COPython looks correct but here's a bit of an explanation by example...

Consider f:

def f(a,b,c=1):
    pass

f expects values for a and b no matter what. You can leave c blank.

f(1,2)     #executes fine
f(a=1,b=2) #executes fine
f(1,c=1)   #TypeError: f() takes at least 2 arguments (2 given)

The fact that you are being blocked by Google is probably due to your user-agent settings in your header... I am unfamiliar with R but I can give you the general algorithm for fixing this:

  1. use a normal browser (firefox or whatever) to access the url while monitoring HTTP traffic (I like wireshark)
  2. take note of all headers sent in the appropriate http request
  3. try running your script and also note the headings
  4. spot the difference
  5. set your R script to make use the headers you saw when examining browser traffic
Sheena
  • 15,590
  • 14
  • 75
  • 113
1

here is the call signature of query()...

def query(searchstr, outformat, allresults=False)

thus you need to specify a searchstr AND an outformat at least, and allresults is an optional flag/argument.

Cameron Sparr
  • 3,925
  • 2
  • 22
  • 31
  • which appears to be contrary to their documentation, not sure what to say about that one.... – Cameron Sparr Nov 02 '12 at 18:11
  • Thanks for the answer, but I tried that already (sorry for not being clear enough), so e.g. when I go query("my query", 4, allresults=False) - 4 should be BibTex if I understand correctly - then I get the following: function query in gscholar.py at line 66 response = urllib2.urlopen(request) function urlopen in urllib2.py at line 126 return _opener.open(url, data, timeout) function open in urllib2.py at line 400 response = meth(req, response) function http_response in urllib2.py at line 513 'http', request, response, code, msg, hdrs), etc. – Flow Nov 02 '12 at 18:22
  • hmmm, sounds like you may have two separate problems then. One is getting the call signature correct (Note that the outformat is NOT an optional argument, you MUST specify it). Second is there appears that urllib2 (the standard Python lib for opening urls) is having problems with the url you've given it. – Cameron Sparr Nov 02 '12 at 20:02
1

An ideal scenario is when you have good proxies, residential is ideal which will allow you to choose a specific location (country, city, or a mobile carrier) and CAPTCHA-solving service.

Here's a code snippet to extract data from all available pages using parsel:

from parsel import Selector
import requests, json

headers = {
    'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}

params = {
    'q': 'samsung medical center seoul semiconductor element simulation x-ray fetch',
    'hl': 'en',
    'start': 0
}

# JSON data will be collected here
data = []

while True:
    html = requests.get('https://scholar.google.com/scholar', headers=headers, params=params).text
    selector = Selector(text=html)

    print(f'extrecting {params["start"] + 10} page...')

    # Container where all needed data is located
    for result in selector.css('.gs_r.gs_or.gs_scl'):
        title = result.css('.gs_rt').xpath('normalize-space()').get()
        title_link = result.css('.gs_rt a::attr(href)').get()
        publication_info = result.css('.gs_a').xpath('normalize-space()').get()
        snippet = result.css('.gs_rs').xpath('normalize-space()').get()
        cited_by_link = result.css('.gs_or_btn.gs_nph+ a::attr(href)').get()

        data.append({
            'page_num': params['start'] + 10, # 0 -> 1 page. 70 in the output = 7th page
            'title': title,
            'title_link': title_link,
            'publication_info': publication_info,
            'snippet': snippet,
            'cited_by_link': f'https://scholar.google.com{cited_by_link}',
        })
    
    if selector.css('.gs_ico_nav_next').get():
        params['start'] += 10
    else:
        break

print(json.dumps(data, indent = 2, ensure_ascii = False))

As an alternative solution, you can use Google Scholar API from SerpApi.

It's a paid API with a free plan that bypasses blocks from Google via proxies and CAPTCHA solving solutions, can scale to enterprise-level plus there's no need for the end-user to create a parser from scratch and maintain it over time if something in the HTML is changed.

Also, it supports cite, profile, author results.

Example code to integrate to parse organic results:

import json

from serpapi import GoogleScholarSearch

params = {
    "api_key": "Your SerpAPi API KEY",
    "engine": "google_scholar",
    "q": "biology",
    "hl": "en"
}

search = GoogleScholarSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
    print(json.dumps(result, indent=2))

# first organic results output:
'''
{
  "position": 0,
  "title": "The biology of mycorrhiza.",
  "result_id": "6zRLFbcxtREJ",
  "link": "https://www.cabdirect.org/cabdirect/abstract/19690600367",
  "snippet": "In the second, revised and extended, edition of this work [cf. FA 20 No. 4264], two new chapters have been added (on carbohydrate physiology physiology Subject Category \u2026",
  "publication_info": {
    "summary": "JL Harley - The biology of mycorrhiza., 1969 - cabdirect.org"
  },
  "inline_links": {
    "serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=6zRLFbcxtREJ",
    "cited_by": {
      "total": 704,
      "link": "https://scholar.google.com/scholar?cites=1275980731835430123&as_sdt=5,50&sciodt=0,50&hl=en",
      "cites_id": "1275980731835430123",
      "serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=5%2C50&cites=1275980731835430123&engine=google_scholar&hl=en"
    },
    "related_pages_link": "https://scholar.google.com/scholar?q=related:6zRLFbcxtREJ:scholar.google.com/&scioq=biology&hl=en&as_sdt=0,50",
    "versions": {
      "total": 4,
      "link": "https://scholar.google.com/scholar?cluster=1275980731835430123&hl=en&as_sdt=0,50",
      "cluster_id": "1275980731835430123",
      "serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=0%2C50&cluster=1275980731835430123&engine=google_scholar&hl=en"
    },
    "cached_page_link": "https://scholar.googleusercontent.com/scholar?q=cache:6zRLFbcxtREJ:scholar.google.com/+biology&hl=en&as_sdt=0,50"
  }
}
... other results
'''

There's also a dedicated Scrape historic Google Scholar results using Python at SerpApi blog post of mine.

Disclaimer, I work for SerpApi.

Dmitriy Zub
  • 1,398
  • 8
  • 35
0

You may want to use Greasemonkey for this task. The advantage is that google will fail to detect you as a bot if you keep the request frequency down in addition. You can also watch the script working in your browser window.

You can learn to code it yourself or use a script from one of these sources.

mab
  • 2,658
  • 26
  • 36