9

I am trying to get information on a large number of scholarly articles as part of my research study. The number of articles is on the order of thousands. Since Google Scholar does not have an API, I am trying to scrape/crawl scholar. Now I now, that this is technically against the EULA, but I am trying to be very polite and reasonable about this. I understand that Google doesn't allow bots in order to keep traffic within reasonable limits. I started with a test batch of ~500 hundred requests with 1s in between each request. I got blocked after about the first 100 requests. I tried multiple other strategies including:

  1. Extending the pauses to ~20s and adding some random noise to them
  2. Making the pauses log-normally distributed (so that most pauses are on the order of seconds but every now and then there are longer pauses of several minutes and more)
  3. Doing long pauses (several hours) between blocks of requests (~100).

I doubt that at this point my script is adding any considerable traffic over what any human would. But one way or the other I always get blocked after ~100-200 requests. Does anyone know of a good strategy to overcome this (I don't care if it takes weeks, as long as it is automated). Also, does anyone have experience dealign with Google directly and asking for permission to do something similar (for research etc.)? Is it worth trying to write them and explain what I'm trying to do and how, and see whether I can get permission for my project? And how would I go about contacting them? Thanks!

Peter
  • 191
  • 2
  • 8
  • How does microsoft's academic search stack up vs google's? – Padraic Cunningham Mar 28 '16 at 20:52
  • [this adds to the discussion](https://www.quora.com/Why-doesnt-Google-have-an-official-API-for-Google-Scholar) – Noam Hacker Mar 28 '16 at 20:53
  • 1
    I hope you've set your `User-Agent` in your request headers correctly - a request that doesn't set it correctly is easily detected as a bot. :) – Akshat Mahajan Mar 28 '16 at 21:34
  • @Liongold I'm using selenium which drives an actual browser to do the requests, so the User-Agent should be taken directly from the browser. Despite this, I always get blocked. – Peter Mar 29 '16 at 19:27
  • Peter, using Sys.sleep(runif(1, 1, 3) in a for() loop is a simple solution to getting past the bot detection with Rselenium that has worked for me in the past. I am looking to do the same task as you. Can you share a MWE? – rkmorgan Apr 04 '16 at 19:56
  • @Padraic, according to Wikipedia, MS Academic search is basically abandoned. https://en.wikipedia.org/wiki/Microsoft_Academic_Search – hhk Apr 08 '16 at 16:56

1 Answers1

2

Without testing, I'm still pretty sure one of the following does the trick :

  1. Easy, but small chance of success :

    Delete all cookies from site in question after every rand(0,100) request,
    then change your user-agent, accepted language, etc. and repeat.

  2. A bit more work, but a much sturdier spider as result :

    Send your requests through Tor, other proxies, mobile networks, etc. to mask your IP (also do suggestion 1 at every turn)

Update regarding Selenium I missed the fact that you're using Selenium, took for granted it was some kind of modern programming language only (I know that Selenium can be driven by most widely used languages, but also as some sort of browser plug-in, demanding very little programming skills).

As I then presume your coding skills aren't (or weren't?) mind-boggling, and for others with the same limitations when using Selenium, my answer is to either learn a simple, scripting language (PowerShell?!) or JavaScript (since it's the web you're on ;-)) and take it from there.

If automating scraping smoothly was as simple as a browser plug-in, the web would have to be a much more messy, obfuscated and credential demanding place.

Morten Bergfall
  • 2,296
  • 4
  • 20
  • 35