0

I am trying to do some web-scraping for a project for my study. Unfortunately I need to try and scrape some data of Google Scholar which blocks my requests. I have tried using (multiple) http proxies but my requests still get blocked after ~300 tries.

The resulting html from the blocked requests contains:

 IP address: 145.109...<br/>Time: 2016-05-05T09:23:37Z<br/>URL: 
 https://scholar.google.nl/citations?hl=en&amp;view_op=search_authors
 &amp;mauthors=Perry<br/>

The above IP is my own, while my proxies dict (it selects a proxy from a list at random) and get request look like this:

proxies = {'http': 'http://<username>:<password>@107.182....:<port>'}

result = requests.get('https://scholar.google.nl/citations?hl=en&         
                      amp;view_op=search_authors&amp;mauthors=Perry',
                      proxies=proxies, headers=headers)

The IPs of are of course valid and work and my own ip is not included in the proxy list. Am I doing something wrong?

Edit: For completeness, i have also tried setting authentication like this answer suggests but the result is the same.

Community
  • 1
  • 1
Truub
  • 87
  • 4
  • 11
  • What is ``? If it's more entries with `http` as key, this is a dict, only one will be retained. And you're requesting a https url, so if you don't have a https entry in your proxies dict, no proxy will be used. – mata May 05 '16 at 10:12
  • Ah badly worded, I'll edit my question. The proxies are actually contained in a list and it selects one at random and adds that to the dict. But it being https and the proxy http solves the question. Could you maybe add it as an answer so I can select it? Quite stupid that I missed that -_-, thanks! – Truub May 05 '16 at 10:25

1 Answers1

2

In your proxies dict the url scheme doesn't match the one you're using for your request, you use a http entry for your proxies but then make a https request. If you ad a proxy for the https scheme, then it should work.

mata
  • 67,110
  • 10
  • 163
  • 162