I am trying to get information on a large number of scholarly articles as part of my research study. The number of articles is on the order of thousands. Since Google Scholar does not have an API, I am trying to scrape/crawl scholar. Now I now, that this is technically against the EULA, but I am trying to be very polite and reasonable about this. I understand that Google doesn't allow bots in order to keep traffic within reasonable limits. I started with a test batch of ~500 hundred requests with 1s in between each request. I got blocked after about the first 100 requests. I tried multiple other strategies including:
- Extending the pauses to ~20s and adding some random noise to them
- Making the pauses log-normally distributed (so that most pauses are on the order of seconds but every now and then there are longer pauses of several minutes and more)
- Doing long pauses (several hours) between blocks of requests (~100).
I doubt that at this point my script is adding any considerable traffic over what any human would. But one way or the other I always get blocked after ~100-200 requests. Does anyone know of a good strategy to overcome this (I don't care if it takes weeks, as long as it is automated). Also, does anyone have experience dealign with Google directly and asking for permission to do something similar (for research etc.)? Is it worth trying to write them and explain what I'm trying to do and how, and see whether I can get permission for my project? And how would I go about contacting them? Thanks!