Using tor and python to scrape Google Scholar

Question

I'm working on a project to analyse how journal articles are cited. I have a large file of journal article names. I intend to pass them to Google Scholar and see how many citations each has.

Here is the strategy I am following:

Use "scholar.py" from http://www.icir.org/christian/scholar.html. This is a pre written python script that searches google scholar and returns information on the first hit in CSV format (including number of citations)
Google scholar blocks you after a certain number of searches (I have roughly 3000 article titles to query). I have found that most people use Tor ( How to make urllib2 requests through Tor in Python? and Prevent Custom Web Crawler from being blocked) to solve this problem. Tor is a service that gives you a random IP address every few minutes.

I have scholar.py and tor both successfully set up and working. I'm not very familiar with python or the library urllib2 and wonder what modifications are needed to scholar.py so that queries are routed through Tor.

I am also amenable to suggestions for an easier (and potentially considerably different) approach for mass google scholar queries if one exists.

Thanks in advance

Paulo Scardine · Answer 1 · 2017-09-12T07:23:23.787

For me the best way to use TOR is setting up a local proxy like polipo. I like to clone the repo and compile locally:

git clone https://github.com/jech/polipo.git
cd polipo
make all
make install

But you can use your package manager (brew install polipo in mac, apt install polipo on Ubuntu). Then write a simple config file:

echo socksParentProxy=localhost:9050 > ~/.polipo
echo diskCacheRoot='""' >> ~/.polipo
echo disableLocalInterface=true >> ~/.polipo

Then run it:

polipo

See urllib docs on how to use a proxy. Like many unix applications, urllib will honor the environment variable http_proxy:

export http_proxy="http://localhost:8123"
export https_proxy="http://localhost:8123"

I like to use the requests library, a nicer wrapper for urllib. If you don't have it already:

pip install requests

If urllib is using Tor the following one-liner should print True:

python -c "import requests; print('Congratulations' in requests.get('http://check.torproject.org/').text)"

Last thing, beware: the Tor network is not a free pass for doing silly things on the Internet because even using it you should not assume you are totally anonymous.

Link rot, that is why link-only answers sucks... I should include the instructions in the answer, unfortunately I lack the time to do it right now, sorry. — Paulo Scardine, Sep 15 '14 at 13:12

Using tor and python to scrape Google Scholar

1 Answers1